Aid reiteration with new class: gfic

Backgound: Most functions that take an iterable as parameter iterate just once. Users of the function can pass any iterable, including iterators. The function will call iter(imput) and go. If the user wants to iterate thru a virtual collection defined by an instance of an iterator class (built-in or user-defined) or generator function, the user must call the gf/ic with the appropriate arguments and pass the result. Problem: Some functions iterate thru the input more than once. If the input is not an iterator, there is no problem: call iter(input) again. If the input is an iterator, that just returns the now-empty iterator. Bad. General solution: The most general solution is for the caller to pass the result of list(it) instead of it. If the caller starts with only an iterator, that is the only solution I know of. Special solution: If the caller starts with an non-iterable iterator source -- iterator class or generator function (which can be regarded as defining a virtual subclass of iterator class 'generator') -- it might be better to package that source and required args as an iterable. class gfic: #untested as yet def __init__(self, func, *args, **kwds): self.func = func self.args = args self.kwds = kwds self.__name__ = func.__name__ def __iter__(self): return self.func(*self.args, **self.kwds) This is similar, I believe, to functools.partial except that the wrapping is total rather than partial and the wrapped call is in __iter__ instead of __call__. Would this be a sensible addition to functools, say, or is the use case too rare? Terry Jan Reedy

On Fri, 19 Jun 2009 07:02:12 am Terry Reedy wrote:
I don't think it's an undue burden on the caller to construct their own iterator before calling your function. To summarise your idea (correct me if I'm wrong): There are three ways of passing iterable-like arguments: (A) pass a sequence object like lists; (B) pass generators or iterators; (C) pass a constructor which returns a generator or iterator, plus appropriate arguments. Insider your function, you can easily iterate over (A) or (C) multiple times, but not (B). I don't think there's any way to treat all three cases identically. Your proposed gfic class will allow you to treat case (C) just like (A), at the cost of expecting the caller to call gfic() rather than pass a constructor directly. But the caller still can't pass an iterator in place of a sequence or constructor, so you haven't solved anything, merely shifted the burden on the user from calling one of: function(list(constructor(*args, **kwargs))) function(list(iterator)) to calling one of: function(gfic(constructor, *args, **kwargs)) function(list(iterator)) As I see it, the correct solution for "my function needs to iterate over an iterable twice" is not to expect the caller to pass a sequence, but to convert the iterable to a sequence inside your function: def function(iterable): # Iterate over iterable twice L = list(iterable) for _ in (1, 2): for x in L: pass Apart from needing to knowing to avoid non-terminating iterators, the user need not know whether you walk the input once or twice. I suppose there is some benefit when dealing with huge iterators, but that's probably best dealt with on an ad hoc basis by requiring the user to pass a constructor directly. -- Steven D'Aprano

Steven D'Aprano wrote:
Only if an iterator is not allowed. Nor do I think it an undue burden either to wrap the constructor, when appropriate, instead of calling it.
Right.
That is precisely the point.
See above
Right. I consider list-avoidance a good thing, more so, it appears than you ;-).
Thank you for your response and analysis. tjr

2009/6/18 Terry Reedy <tjreedy@udel.edu>:
So: The problem is to detect which case you have. In the bad case, call list(input) and iterate over that twice. Which leaves two problems: 1. Detecting the 2 cases. 2. The cost of keeping all of the generated values in a list. For 1, you actually don't need to detect the case of a container, just call list() anyway. Unless the input is so huge that having a second copy is a substantial cost, who cares? For 2, you have 2 choices: 1. Document for callers that you'll be duplicating the input. Then your API isn't usable by people with infinite (or practically infinite) streams - but how much of a problem is this anyway? 2. Design a different API - maybe take explicit callable/arguments parameters. Then people passing a list have to do my_api(iter, L), but you're designing for huge iterators as input, so presumably you don't care about their convenience so much. Actually, I think you *can* detect that you were passed an iterable - if "iter(it) is iter(it)" returns True, then that's when you need to copy[1]. So you could document your API as copying the input if that condition is true. Then, if some user wants to use your function on a huge stream, they know what they are doing, and they can wrap their iterator in a class like your gfic (you could even document this technique in your documentation). So all of this is entirely manageable within your documentation. You could put a sample up as cookbook code, as well, I guess. And honestly, I don't believe that the problem is common enough to warrant anything more - certainly not standard library support. Paul. [1] There may be some edge cases I'm ignoring here.

Paul Moore wrote:
You are looking at the issue from the callee's standpoint. And you have made some good suggestions from that viewpoint. I am looking at the issue from the caller's viewpoint. I have an iterator constructor and I want to use an existing function that requires a re-iterable.
honestly, I don't believe that the problem is common enough to warrant anything more - certainly not standard library support.
The issue is more frequent in Py3, which largely shifts from lists to iterators as the common sequence interchange format. Py3 is what I use and the target of my suggestion. I should have said that. Map and filter, for instance, are now iterator classes, which is to say, iterator constructors, rather than list-returning functions. A call such as somefunc(map(f,l)) which works fine in 2.x will not work in 3.x if somefunc requires a re-iterable. But I agree that it seems to still be too rare for stdlib. I will, however, put the class in the code part of my book-in-progress as 'reiterable'. I am sure that the discussion here will help improve the text discussion of this issue and this class. Terry Jan Reedy

2009/6/20 Terry Reedy <tjreedy@udel.edu>:
OK. But how many functions require a re-iterable? It doesn't seem to me to be a common requirement. And if they did, how often would it be a problem to construct a list? I see your point, but I can't imagine it's a common need. But your experience may differ. Paul.

Paul Moore wrote:
Mostly avoiding this discussion, but I think the use cases are even narrower than that. 1. I've never seen a function parameter spec'ed as iterable vs reiterable. It's always iterable vs sequence. 2. "Sequence" implies a lot more than "reiterable" does: it also implies length, containment test support, random access. So any idea focused solely on reiterability misses out on those extra features. 3. Terry's idea also assumes a deterministic data source. An iterator with a random element or a data feed from the real world can't be used with it (because the original sequence can't be reproduced without saving it). 4. The only reliable way to make an arbitrary iterator reiterable is to cache the results in memory, but we already have a way to do that (i.e. store it in a list or tuple) 5. For iterator based cases where only *some* of the values need to be kept around rather than all of them, then the algorithm implementation should be redesigned so it can make effective use of itertools.tee() In other words, this strikes as a fairly complicated solution to a rather limited problem. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Nick Coghlan wrote:
The latter could be over-spec'ed because people are used to that being the choice, or because 'sequence' is being used as a synonym for 'reiterable' even though it implies more than is needed.
Which may or may not be needed. def pairs(finite_reiterable) for i in finite_reiterable: for j in finite)reiterable: return (i,j) is a simple function for which 'finite_reiterable' is exactly the appropriate spec. The issue is whether it is best to wrap it internally (inline) with a generic list(it) or to let the caller special-case iterator constructors.
Yes.
Which I mentioned in my original post. I specificlly considered how to make iterator constructors, more common in 3.x, reiterable *without* storing everything in memory. Avoiding such storage is why map, filter, range, for example, were changed. So it hardly seems like an oddball aim.
As I understand, everything goes into the tee and everything comes out in order, so I don't understand what you mean keeping some value.
In other words, this strikes as a fairly complicated solution to a rather limited problem.
I believe someone else though it was too simple for the stdlib ;-) The solution is a variation of functools.partial. I would certainly look for more concrete use cases, perhaps in the stdlib, before seriously proposing it for inclusion. Terry Jan Reedy

Terry Reedy wrote:
As I understand, everything goes into the tee and everything comes out in order, so I don't understand what you mean keeping some value.
itertools.tee() knows where each of the spawned iterators is up to in the sequence so it knows when it can throw away old values. In the degenerate case (such as your cartesian product example) it has to store the whole series in memory but it can usually do better than that (based on how far apart the spawned iterators get).
I believe someone else though it was too simple for the stdlib ;-) The solution is a variation of functools.partial.
Approaching it as a variant of functools.partial certainly sounds like the way to go (and rereading your initial post, I realise you have indeed been doing that all along). It will be interesting to see what your search for concrete use cases turns up. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

2009/6/21 Terry Reedy <tjreedy@udel.edu>:
OTOH, it could be because "reiterable" isn't a commonly used concept in Python. It's not entirely clear to me why this is, maybe it's because it's not actually that useful, maybe it's because it's hard to define clearly, or maybe there's another reason. But to some extent this whole thread has the same sort of "solution looking for a problem" feel that earlier threads about reiterability have had. FWIW, the difference between iterators and reiterables is similar to that between C++ input iterators and forward iterators. So it may be worth looking to that in the quest for use cases. However, itertools.tee may cover a number of cases that would need a forward iterator in C++. Paul.

Paul Moore wrote:
My thought is this. iterable and iterator are separate and complementary concepts: a collection (concrete or virtual) that can be iterated over and the thing that does the iterating. For (great) programming convenience, Python iterators were made into a subcategory of an expanded 'iterable' category. In informal discourse, people sometimes, perhaps often, use 'iterable' in the narrower conceptual sense rather than the broader Python-implementation sense. Or they sort of mean it in the narrow sense but it does not matter because the statement also applies to iterators once they are made into iterables. To be clear in a Python context, when one wants to say something that does not apply to iterators, one can qualify 'iterable' to 'non-iterator iterable' or, less commonly at present, use the synonym 'reiterable'. I say synonym because any Python iterable object is reiterable as long as the information defining it is not altered, destroyed, or 'forgotten' by the process of iteration.
It is a solution to a conceptual and didactic problem that I offered for *possible* practical use. But I am not pushing it. Terry Jan Reedy

Terry Reedy wrote:
The issue is more frequent in Py3, which largely shifts from lists to iterators as the common sequence interchange format.
I don't think that's right. As I understand it, things like dict.items() now return *views*, which *are* reiterable (as long as the underlying data doesn't change). -- Greg

On Fri, 19 Jun 2009 07:02:12 am Terry Reedy wrote:
I don't think it's an undue burden on the caller to construct their own iterator before calling your function. To summarise your idea (correct me if I'm wrong): There are three ways of passing iterable-like arguments: (A) pass a sequence object like lists; (B) pass generators or iterators; (C) pass a constructor which returns a generator or iterator, plus appropriate arguments. Insider your function, you can easily iterate over (A) or (C) multiple times, but not (B). I don't think there's any way to treat all three cases identically. Your proposed gfic class will allow you to treat case (C) just like (A), at the cost of expecting the caller to call gfic() rather than pass a constructor directly. But the caller still can't pass an iterator in place of a sequence or constructor, so you haven't solved anything, merely shifted the burden on the user from calling one of: function(list(constructor(*args, **kwargs))) function(list(iterator)) to calling one of: function(gfic(constructor, *args, **kwargs)) function(list(iterator)) As I see it, the correct solution for "my function needs to iterate over an iterable twice" is not to expect the caller to pass a sequence, but to convert the iterable to a sequence inside your function: def function(iterable): # Iterate over iterable twice L = list(iterable) for _ in (1, 2): for x in L: pass Apart from needing to knowing to avoid non-terminating iterators, the user need not know whether you walk the input once or twice. I suppose there is some benefit when dealing with huge iterators, but that's probably best dealt with on an ad hoc basis by requiring the user to pass a constructor directly. -- Steven D'Aprano

Steven D'Aprano wrote:
Only if an iterator is not allowed. Nor do I think it an undue burden either to wrap the constructor, when appropriate, instead of calling it.
Right.
That is precisely the point.
See above
Right. I consider list-avoidance a good thing, more so, it appears than you ;-).
Thank you for your response and analysis. tjr

2009/6/18 Terry Reedy <tjreedy@udel.edu>:
So: The problem is to detect which case you have. In the bad case, call list(input) and iterate over that twice. Which leaves two problems: 1. Detecting the 2 cases. 2. The cost of keeping all of the generated values in a list. For 1, you actually don't need to detect the case of a container, just call list() anyway. Unless the input is so huge that having a second copy is a substantial cost, who cares? For 2, you have 2 choices: 1. Document for callers that you'll be duplicating the input. Then your API isn't usable by people with infinite (or practically infinite) streams - but how much of a problem is this anyway? 2. Design a different API - maybe take explicit callable/arguments parameters. Then people passing a list have to do my_api(iter, L), but you're designing for huge iterators as input, so presumably you don't care about their convenience so much. Actually, I think you *can* detect that you were passed an iterable - if "iter(it) is iter(it)" returns True, then that's when you need to copy[1]. So you could document your API as copying the input if that condition is true. Then, if some user wants to use your function on a huge stream, they know what they are doing, and they can wrap their iterator in a class like your gfic (you could even document this technique in your documentation). So all of this is entirely manageable within your documentation. You could put a sample up as cookbook code, as well, I guess. And honestly, I don't believe that the problem is common enough to warrant anything more - certainly not standard library support. Paul. [1] There may be some edge cases I'm ignoring here.

Paul Moore wrote:
You are looking at the issue from the callee's standpoint. And you have made some good suggestions from that viewpoint. I am looking at the issue from the caller's viewpoint. I have an iterator constructor and I want to use an existing function that requires a re-iterable.
honestly, I don't believe that the problem is common enough to warrant anything more - certainly not standard library support.
The issue is more frequent in Py3, which largely shifts from lists to iterators as the common sequence interchange format. Py3 is what I use and the target of my suggestion. I should have said that. Map and filter, for instance, are now iterator classes, which is to say, iterator constructors, rather than list-returning functions. A call such as somefunc(map(f,l)) which works fine in 2.x will not work in 3.x if somefunc requires a re-iterable. But I agree that it seems to still be too rare for stdlib. I will, however, put the class in the code part of my book-in-progress as 'reiterable'. I am sure that the discussion here will help improve the text discussion of this issue and this class. Terry Jan Reedy

2009/6/20 Terry Reedy <tjreedy@udel.edu>:
OK. But how many functions require a re-iterable? It doesn't seem to me to be a common requirement. And if they did, how often would it be a problem to construct a list? I see your point, but I can't imagine it's a common need. But your experience may differ. Paul.

Paul Moore wrote:
Mostly avoiding this discussion, but I think the use cases are even narrower than that. 1. I've never seen a function parameter spec'ed as iterable vs reiterable. It's always iterable vs sequence. 2. "Sequence" implies a lot more than "reiterable" does: it also implies length, containment test support, random access. So any idea focused solely on reiterability misses out on those extra features. 3. Terry's idea also assumes a deterministic data source. An iterator with a random element or a data feed from the real world can't be used with it (because the original sequence can't be reproduced without saving it). 4. The only reliable way to make an arbitrary iterator reiterable is to cache the results in memory, but we already have a way to do that (i.e. store it in a list or tuple) 5. For iterator based cases where only *some* of the values need to be kept around rather than all of them, then the algorithm implementation should be redesigned so it can make effective use of itertools.tee() In other words, this strikes as a fairly complicated solution to a rather limited problem. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Nick Coghlan wrote:
The latter could be over-spec'ed because people are used to that being the choice, or because 'sequence' is being used as a synonym for 'reiterable' even though it implies more than is needed.
Which may or may not be needed. def pairs(finite_reiterable) for i in finite_reiterable: for j in finite)reiterable: return (i,j) is a simple function for which 'finite_reiterable' is exactly the appropriate spec. The issue is whether it is best to wrap it internally (inline) with a generic list(it) or to let the caller special-case iterator constructors.
Yes.
Which I mentioned in my original post. I specificlly considered how to make iterator constructors, more common in 3.x, reiterable *without* storing everything in memory. Avoiding such storage is why map, filter, range, for example, were changed. So it hardly seems like an oddball aim.
As I understand, everything goes into the tee and everything comes out in order, so I don't understand what you mean keeping some value.
In other words, this strikes as a fairly complicated solution to a rather limited problem.
I believe someone else though it was too simple for the stdlib ;-) The solution is a variation of functools.partial. I would certainly look for more concrete use cases, perhaps in the stdlib, before seriously proposing it for inclusion. Terry Jan Reedy

Terry Reedy wrote:
As I understand, everything goes into the tee and everything comes out in order, so I don't understand what you mean keeping some value.
itertools.tee() knows where each of the spawned iterators is up to in the sequence so it knows when it can throw away old values. In the degenerate case (such as your cartesian product example) it has to store the whole series in memory but it can usually do better than that (based on how far apart the spawned iterators get).
I believe someone else though it was too simple for the stdlib ;-) The solution is a variation of functools.partial.
Approaching it as a variant of functools.partial certainly sounds like the way to go (and rereading your initial post, I realise you have indeed been doing that all along). It will be interesting to see what your search for concrete use cases turns up. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

2009/6/21 Terry Reedy <tjreedy@udel.edu>:
OTOH, it could be because "reiterable" isn't a commonly used concept in Python. It's not entirely clear to me why this is, maybe it's because it's not actually that useful, maybe it's because it's hard to define clearly, or maybe there's another reason. But to some extent this whole thread has the same sort of "solution looking for a problem" feel that earlier threads about reiterability have had. FWIW, the difference between iterators and reiterables is similar to that between C++ input iterators and forward iterators. So it may be worth looking to that in the quest for use cases. However, itertools.tee may cover a number of cases that would need a forward iterator in C++. Paul.

Paul Moore wrote:
My thought is this. iterable and iterator are separate and complementary concepts: a collection (concrete or virtual) that can be iterated over and the thing that does the iterating. For (great) programming convenience, Python iterators were made into a subcategory of an expanded 'iterable' category. In informal discourse, people sometimes, perhaps often, use 'iterable' in the narrower conceptual sense rather than the broader Python-implementation sense. Or they sort of mean it in the narrow sense but it does not matter because the statement also applies to iterators once they are made into iterables. To be clear in a Python context, when one wants to say something that does not apply to iterators, one can qualify 'iterable' to 'non-iterator iterable' or, less commonly at present, use the synonym 'reiterable'. I say synonym because any Python iterable object is reiterable as long as the information defining it is not altered, destroyed, or 'forgotten' by the process of iteration.
It is a solution to a conceptual and didactic problem that I offered for *possible* practical use. But I am not pushing it. Terry Jan Reedy

Terry Reedy wrote:
The issue is more frequent in Py3, which largely shifts from lists to iterators as the common sequence interchange format.
I don't think that's right. As I understand it, things like dict.items() now return *views*, which *are* reiterable (as long as the underlying data doesn't change). -- Greg
participants (7)
-
Georg Brandl
-
Greg Ewing
-
Mike Steder
-
Nick Coghlan
-
Paul Moore
-
Steven D'Aprano
-
Terry Reedy