A cute Python implementation of itertools.tee

Just for fun - no complaint, no suggestion, just sharing a bit of code that tickled me. The docs for `itertools.tee()` contain a Python work-alike, which is easy to follow. It gives each derived generator its own deque, and when a new value is obtained from the original iterator it pushes that value onto each of those deques. Of course it's possible for them to share a single deque, but the code gets more complicated. Is it possible to make it simpler instead? What it "really" needs is a shared singly-linked list of values, pointing from oldest value to newest. Then each derived generator can just follow the links, and yield its next result in time independent of the number of derived generators. But how do you know when a new value needs to be obtained from the original iterator, and how do you know when space for an older value can be recycled (because all of the derived generators have yielded it)? I ended up with almost a full page of code to do that, storing with each value and link a count of the number of derived generators that had yet to yield the value, effectively coding my own reference-count scheme by hand, along with "head" and "tail" pointers to the ends of the linked list that proved remarkably tricky to keep correct in all cases. Then I thought "this is stupid! Python already does reference counting." Voila! Vast swaths of tedious code vanished, giving this remarkably simple implementation: def mytee(xs, n): last = [None, None] def gen(it, mylast): nonlocal last while True: mylast = mylast[1] if not mylast: mylast = last[1] = last = [next(it), None] yield mylast[0] it = iter(xs) return tuple(gen(it, last) for _ in range(n)) There's no need to keep a pointer to the start of the shared list at all - we only need a pointer to the end of the list ("last"), and each derived generator only needs a pointer to its own current position in the list ("mylast"). What I find kind of hilarious is that it's no help at all as a prototype for a C implementation: Python recycles stale `[next(it), None]` pairs all by itself, when their internal refcounts fall to 0. That's the hardest part. BTW, I certainly don't suggest adding this to the itertools docs either. While it's short and elegant, it's too subtle to grasp easily - if you think "it's obvious", you haven't yet thought hard enough about the problem ;-)

On Sun, 15 Apr 2018 00:05:58 -0500 Tim Peters <tim.peters@gmail.com> wrote:
This implementation doesn't work with Python 3.7 or 3.8. I've tried it here: https://gist.github.com/pitrou/b3991f638300edb6d06b5be23a4c66d6 and get: Traceback (most recent call last): File "mytee.py", line 14, in gen mylast = last[1] = last = [next(it), None] StopIteration The above exception was the direct cause of the following exception: Traceback (most recent call last): File "mytee.py", line 47, in <module> run(mytee1) File "mytee.py", line 36, in run lists[i].append(next(iters[i])) RuntimeError: generator raised StopIteration (Yuck!) In short, you want the following instead: try: mylast = last[1] = last = [next(it), None] except StopIteration: return
That's smart and obscure :-o The way it works is that the `last` assignment changes the `last` value seen by all derived generators, while the `last[1]` assignment updates the bindings made in the other generators' `mylast` lists... It's difficult to find the words to explain it. The chained assignment makes it more difficult to parse as well (when I read this I don't know if `last[i]` or `last` gets assigned first; apparently the answer is `last[i]`, otherwise the recipe wouldn't work correctly). Perhaps like this: while True: mylast = mylast[1] if not mylast: try: # Create new list link mylast = [next(it), None] except StopIteration: return else: # Append to other generators `mylast` linked lists last[1] = mylast # Update shared list link last = last[1] yield mylast[0] Regards Antoine.

[Antoine Pitrou <solipsis@pitrou.net>]
Thanks for trying! I wonder whether that will break other code. I wrote PEP 255, and this part was intentional at the time: """ If an unhandled exception-- including, but not limited to, StopIteration --is raised by, OR PASSES THROUGH [emphasis added], a generator function, then the exception is passed on to the caller in the usual way, and subsequent attempts to resume the generator function raise StopIteration. """ I've exploited that a number of times.
No, I don't ;-) If I have to catch StopIteration myself now, then I want the entire "white True:" loop in the "try" block. Setting up try/except machinery anew on each iteration would add significant overhead; doing it just once per derived generator wouldn't.
Which is why I didn't even try - I did warn people that if they thought it "was obvious", they hadn't yet thought hard enough ;-) Good job!
Ya, I had to look it up too :-) Although, like almost everything else in Python, chained assignments proceed "left to right". I was just trying to make it as short as possible, to increase the "huh - can something that tiny really work?!" healthy skepticism factor :-)
I certainly agree that's easier to follow. But that wasn't really the point ;-)

15.04.18 19:52, Tim Peters пише:
This overhead is around 10% of the time for calling `next(it)`. It may be less than 1-2% of the whole step of mytee iteration. I have ideas about implementing zero-overhead try/except, but I have doubts that it is worth. The benefit seems too small.

On Sun, Apr 15, 2018 at 08:35:51PM +0300, Serhiy Storchaka wrote:
I have ideas about implementing zero-overhead try/except, but I have doubts that it is worth. The benefit seems too small.
It is conventional wisdom that catching exceptions is expensive, and that in performance critical code it is better to "look before you leap" if possible, and avoid try...except. Are you saying this advice is obsolete? If not, then perhaps reducing the overhead of catching exceptions may be worthwhile. -- Steve

On Sun, Apr 15, 2018 at 8:05 AM, Tim Peters <tim.peters@gmail.com> wrote: [...]
Things here remind me of my implementation design for PEP 555: the "contexts" present in the process are represented by a singly-linked tree of assignment objects. It's definitely possible to write the above in a more readable way, and FWIW I don't think it involves "assignments as expressions".
Why can't the C implementation use Python refcounts? Are you talking about standalone C code? Or perhaps you are thinking about overhead? (In PEP 555 that was not a concern, though). Surely it would make sense to reuse the refcounting code that's already there. There are no cycles here, so it's not particulaly complicated -- just duplication. Anyway, the whole linked list is unnecessary if the iterable can be iterated over multiple times. But "tee" won't know when to do that. *That* is what I call overhead (unless of course all the tee branches are consumed in an interleaved manner).
-- + Koos Zevenhoven + http://twitter.com/k7hoven +

On Mon, Apr 16, 2018 at 6:46 AM, Koos Zevenhoven <k7hoven@gmail.com> wrote:
But if you have something you can iterate over multiple times, why bother with tee at all? Just take N iterators from the underlying iterable. The overhead is intrinsic to the value of the function. ChrisA

On Sun, Apr 15, 2018 at 11:55 PM, Chris Angelico <rosuav@gmail.com> wrote: the additional buffer or not. It could be a range object or a set or a generator (or iterator), who knows. Even your type checker doesn't know what you need. -- Koos -- + Koos Zevenhoven + http://twitter.com/k7hoven +

[Koos Zevenhoven <k7hoven@gmail.com>]
Of course it is. The point was brevity and speed, not readability. It was presented partly as a puzzle :-)
Why can't the C implementation use Python refcounts? Are you talking about standalone C code?
Yes, expressing the algorithm in plain old C, not building on top of (say) the Python C API.
Or perhaps you are thinking about overhead?
Nope.
If the latter were how iterables always worked, there would be no need for tee() at all. It's tee's _purpose_ to make it possible for multiple consumers to traverse an iterable's can't-restart-or-even -go-back result sequence each at their own pace.

On Mon, Apr 16, 2018 at 12:06 AM, Tim Peters <tim.peters@gmail.com> wrote:
There must have been a reason why pseudo code was "invented".
Yes. (I'm not sure which is easier, going back or starting from the beginning) -- Koos -- + Koos Zevenhoven + http://twitter.com/k7hoven +

I posted this several weeks ago, just for fun - an obscure but surprisingly brief Python implementation of itertools.tee, sharing a single stream (as a singly linked list) among all the returned iterables. Didn't think about it again until today, when recent discussions of lexical scoping made me wonder "hmm - when's the last time I even used nonlocal?". Well, this was. And shortly thereafter, "but there's no need to keep track of `last` at all!" I have no idea where that thought came from :-)
So, in fact, the inner function there can be replaced by the even briefer: def gen(it, mylast): while True: if mylast[1] is None: mylast[1] = [next(it), None] mylast = mylast[1] yield mylast[0] To make it more relevant to current discussions, collapsing the last two lines using a binding expression: yield (mylast := mylast[1])[0] isn't even slightly tempting. Most of the cases where it would be _possible_ to use binding expressions don't strike me as being _sensible_ places to use them - slamming together conceptually different tasks just because it's possible to do so. But because name bindings are so very common, that still leaves plenty where the transformation leaves the code clearer to my eyes (usually just a little, rarely a lot). There are no such cases in the code above.

The topic reminded me of Stephan Houben's streamtee. https://github.com/stephanh42/streamtee/blob/master/streamtee.py It was an attempt at a memory-efficient tee, but it turned out tee was efficient enough. It uses a thunk-like method and recursion to remove the need for an explicit linked list. In the spirit of the topic, I tried to strip it down to the bare essentials: def tee(iterable, n): stream = _streamiter(iter(iterable)) return tuple(_wrapper(stream) for _ in range(n)) def _streamiter(itr): result = (next(itr), _streamiter(itr)) del itr # Release unneeded reference. while True: yield result def _wrapper(stream): while True: result, stream = next(stream) yield result Unfortunately, this code, too, breaks with the new StopIteration propagation rules. On Sun, Apr 15, 2018 at 1:05 AM, Tim Peters <tim.peters@gmail.com> wrote:

On Sun, 15 Apr 2018 00:05:58 -0500 Tim Peters <tim.peters@gmail.com> wrote:
This implementation doesn't work with Python 3.7 or 3.8. I've tried it here: https://gist.github.com/pitrou/b3991f638300edb6d06b5be23a4c66d6 and get: Traceback (most recent call last): File "mytee.py", line 14, in gen mylast = last[1] = last = [next(it), None] StopIteration The above exception was the direct cause of the following exception: Traceback (most recent call last): File "mytee.py", line 47, in <module> run(mytee1) File "mytee.py", line 36, in run lists[i].append(next(iters[i])) RuntimeError: generator raised StopIteration (Yuck!) In short, you want the following instead: try: mylast = last[1] = last = [next(it), None] except StopIteration: return
That's smart and obscure :-o The way it works is that the `last` assignment changes the `last` value seen by all derived generators, while the `last[1]` assignment updates the bindings made in the other generators' `mylast` lists... It's difficult to find the words to explain it. The chained assignment makes it more difficult to parse as well (when I read this I don't know if `last[i]` or `last` gets assigned first; apparently the answer is `last[i]`, otherwise the recipe wouldn't work correctly). Perhaps like this: while True: mylast = mylast[1] if not mylast: try: # Create new list link mylast = [next(it), None] except StopIteration: return else: # Append to other generators `mylast` linked lists last[1] = mylast # Update shared list link last = last[1] yield mylast[0] Regards Antoine.

[Antoine Pitrou <solipsis@pitrou.net>]
Thanks for trying! I wonder whether that will break other code. I wrote PEP 255, and this part was intentional at the time: """ If an unhandled exception-- including, but not limited to, StopIteration --is raised by, OR PASSES THROUGH [emphasis added], a generator function, then the exception is passed on to the caller in the usual way, and subsequent attempts to resume the generator function raise StopIteration. """ I've exploited that a number of times.
No, I don't ;-) If I have to catch StopIteration myself now, then I want the entire "white True:" loop in the "try" block. Setting up try/except machinery anew on each iteration would add significant overhead; doing it just once per derived generator wouldn't.
Which is why I didn't even try - I did warn people that if they thought it "was obvious", they hadn't yet thought hard enough ;-) Good job!
Ya, I had to look it up too :-) Although, like almost everything else in Python, chained assignments proceed "left to right". I was just trying to make it as short as possible, to increase the "huh - can something that tiny really work?!" healthy skepticism factor :-)
I certainly agree that's easier to follow. But that wasn't really the point ;-)

15.04.18 19:52, Tim Peters пише:
This overhead is around 10% of the time for calling `next(it)`. It may be less than 1-2% of the whole step of mytee iteration. I have ideas about implementing zero-overhead try/except, but I have doubts that it is worth. The benefit seems too small.

On Sun, Apr 15, 2018 at 08:35:51PM +0300, Serhiy Storchaka wrote:
I have ideas about implementing zero-overhead try/except, but I have doubts that it is worth. The benefit seems too small.
It is conventional wisdom that catching exceptions is expensive, and that in performance critical code it is better to "look before you leap" if possible, and avoid try...except. Are you saying this advice is obsolete? If not, then perhaps reducing the overhead of catching exceptions may be worthwhile. -- Steve

On Sun, Apr 15, 2018 at 8:05 AM, Tim Peters <tim.peters@gmail.com> wrote: [...]
Things here remind me of my implementation design for PEP 555: the "contexts" present in the process are represented by a singly-linked tree of assignment objects. It's definitely possible to write the above in a more readable way, and FWIW I don't think it involves "assignments as expressions".
Why can't the C implementation use Python refcounts? Are you talking about standalone C code? Or perhaps you are thinking about overhead? (In PEP 555 that was not a concern, though). Surely it would make sense to reuse the refcounting code that's already there. There are no cycles here, so it's not particulaly complicated -- just duplication. Anyway, the whole linked list is unnecessary if the iterable can be iterated over multiple times. But "tee" won't know when to do that. *That* is what I call overhead (unless of course all the tee branches are consumed in an interleaved manner).
-- + Koos Zevenhoven + http://twitter.com/k7hoven +

On Mon, Apr 16, 2018 at 6:46 AM, Koos Zevenhoven <k7hoven@gmail.com> wrote:
But if you have something you can iterate over multiple times, why bother with tee at all? Just take N iterators from the underlying iterable. The overhead is intrinsic to the value of the function. ChrisA

On Sun, Apr 15, 2018 at 11:55 PM, Chris Angelico <rosuav@gmail.com> wrote: the additional buffer or not. It could be a range object or a set or a generator (or iterator), who knows. Even your type checker doesn't know what you need. -- Koos -- + Koos Zevenhoven + http://twitter.com/k7hoven +

[Koos Zevenhoven <k7hoven@gmail.com>]
Of course it is. The point was brevity and speed, not readability. It was presented partly as a puzzle :-)
Why can't the C implementation use Python refcounts? Are you talking about standalone C code?
Yes, expressing the algorithm in plain old C, not building on top of (say) the Python C API.
Or perhaps you are thinking about overhead?
Nope.
If the latter were how iterables always worked, there would be no need for tee() at all. It's tee's _purpose_ to make it possible for multiple consumers to traverse an iterable's can't-restart-or-even -go-back result sequence each at their own pace.

On Mon, Apr 16, 2018 at 12:06 AM, Tim Peters <tim.peters@gmail.com> wrote:
There must have been a reason why pseudo code was "invented".
Yes. (I'm not sure which is easier, going back or starting from the beginning) -- Koos -- + Koos Zevenhoven + http://twitter.com/k7hoven +

I posted this several weeks ago, just for fun - an obscure but surprisingly brief Python implementation of itertools.tee, sharing a single stream (as a singly linked list) among all the returned iterables. Didn't think about it again until today, when recent discussions of lexical scoping made me wonder "hmm - when's the last time I even used nonlocal?". Well, this was. And shortly thereafter, "but there's no need to keep track of `last` at all!" I have no idea where that thought came from :-)
So, in fact, the inner function there can be replaced by the even briefer: def gen(it, mylast): while True: if mylast[1] is None: mylast[1] = [next(it), None] mylast = mylast[1] yield mylast[0] To make it more relevant to current discussions, collapsing the last two lines using a binding expression: yield (mylast := mylast[1])[0] isn't even slightly tempting. Most of the cases where it would be _possible_ to use binding expressions don't strike me as being _sensible_ places to use them - slamming together conceptually different tasks just because it's possible to do so. But because name bindings are so very common, that still leaves plenty where the transformation leaves the code clearer to my eyes (usually just a little, rarely a lot). There are no such cases in the code above.

The topic reminded me of Stephan Houben's streamtee. https://github.com/stephanh42/streamtee/blob/master/streamtee.py It was an attempt at a memory-efficient tee, but it turned out tee was efficient enough. It uses a thunk-like method and recursion to remove the need for an explicit linked list. In the spirit of the topic, I tried to strip it down to the bare essentials: def tee(iterable, n): stream = _streamiter(iter(iterable)) return tuple(_wrapper(stream) for _ in range(n)) def _streamiter(itr): result = (next(itr), _streamiter(itr)) del itr # Release unneeded reference. while True: yield result def _wrapper(stream): while True: result, stream = next(stream) yield result Unfortunately, this code, too, breaks with the new StopIteration propagation rules. On Sun, Apr 15, 2018 at 1:05 AM, Tim Peters <tim.peters@gmail.com> wrote:
participants (7)
-
Antoine Pitrou
-
Chris Angelico
-
Franklin? Lee
-
Koos Zevenhoven
-
Serhiy Storchaka
-
Steven D'Aprano
-
Tim Peters