[Python-ideas] Introduce collections.Reiterable

Fri Sep 20 03:28:04 CEST 2013

Answering am going answer three people in one response.
In no particular order...

On 9/19/2013 9:02 AM, Nick Coghlan wrote:
> So, my question is a genuine one. While, *in theory*, an object can
> define a stateful __iter__ method that (e.g.) only works the first
> time it is called, or returns a separate object that still stores it's
> "current position" information on the original container, I simply
> can't think of a non-pathological case where "isinstance(obj,
> Iterable) and not isinstance(obj, Iterator)" would give the wrong
> answer.
> In theory, yes, an object could obviously pass that test and still not
> be Reiterable, but I'm interested in what's true in *practice*.

On 9/19/2013 6:26 AM, Antoine Pitrou wrote:
 >> A slight problem is that there is no guaranteed that a non-iterator
 >> iterable is re-iterable.
 > Any useful examples?

On 9/19/2013 7:37 AM, Joshua Landau wrote:> On 19 September 2013 11:28, 
Terry Reedy <tjreedy at udel.edu> wrote:
 >> Not everything in that category is necessarily re-iterable.
 > I cannot think of a non-pathological case where it is not; if it is
 > not re-iterable it should be changed to an iterator if it isn't
 > already.

[I think 'pathological' is a bit 'heavy' as a synonym for 'poorly 
written' ;=]

 >> Or if it is serially reiterable, it may not be parallel iterable,
 >> as needed for nested loops.
 > What do you mean?

To back up a bit: When dev write a function, dev is responsible to 
specify acceptible inputs. Neither the language or custom require dev to 
test that inputs meet the specification. Looking before leaping may not 
always work. I believe this to be true when inputs are iterables.

When user calls a function, user is responsible to provide arguments 
that meet the specification and accept the consequences either way.

When dev specifies an 'iterable' argument, he is (should be) saying that 
the argument will be iterated at most once and probably will be iterated 
eventually. If user passes an iterator, user should (except possibly in 
rare cases) not use it otherwise.

The first problem, which impinges on both specification and reiteration, 
is than an iterable may be either finite, or not, or 'in between' 
depending the hardware and user needs. I think we should take 'iterable' 
to mean 'finite iterable' unless dev explicitly relaxes that by saying 
'possibly infinite iterable'.  (To be clear, infinite iterables are 
extremely useful.)

An additional complication, including for reiteration, is that 
'practically' finite may be different for time and space. For instance, 
'for i in range(10000000000): pass # 10 billion iterations' would take 
about 5 minute on my machine while list(range(10000000000)) would fail. 
(The opposite situation is possible, but less relevant to this issue.)

Currently, if dev needs to iterate an input more than once, the 
specification should say so. If the user wants to pass an iterator, the 
user can instead pass list(iter). The reason to have user rather than 
dev make this call is that user is in a better position than dev to know 
whether iter is effectively finite.

Now to the varieties of reiteration:

A. Serial: iterate the input (typically to exhaustion) and then 
reiterate (typically to exhaustion). In the typical case, the iterable 
must be finite. Given finite iterator iter, list(iter) is probably more 
efficient than tee(iter). But let user decide if either is sensible.

B. Parallel: iterate the input with two iterators that march along more 
or less in parallel. The degenerate extreme 'for a,b in zip(iter,iter):' 
would be better written 'for a in iter: b = a'. If the two iterators are 
mostly in sync, then the second iterator is only really needed when they 
diverge. In any case, parallel iteration is best handled internally, 
invisible to the caller, with tee or two or more indexes. (Indexes into 
a concrete collection are nice because it is so easy to sync one to the 
other -- 'i = j' or 'j = i'.) While re does this with finite strings, 
the underlying iterable for such functions does not, in general, need to 
be finite.

C: Crossed: iterate different dimensions in 'crossed' fashion. "for i in 
row: for j in column". For this to involve reiteration, case one is 
square arrays iterated by index. But then it is not an issue, as that 
will be done with a reiterable range. Case two is with multiple iterator 
inputs, with cross products as one example:

def cross(itera, iterb):
   for a in itera:
     for b in iterb:
       yield a,b

The doc should specify that itera and iterb must be independent 
iterables. Note that the outermost iterator does not have to be finite.

Useful example and determinism: generator functions are callable but not 
iterable. For the simple iterate once situation, one calls and passes 
the resulting generator. For reiteration, the following may work:

class GenfIt:
   def __init__(self, genf, *args):
     self.genf = genf
     self.args = args
   def __iter__(self):
     return self.genf(*args)

However, another hidden assumption in this thread has been that 
non-iterator iterables are deterministic, in the sense that re-calling 
iter(it) returns an iterator that yields the same sequence of items 
before raising StopIteration. Some very useful iterator-producing 
functions do not do that (ones returning iterators based on 
pseudo-random or external inputs). So we need to add 'deterministic' to 
the notion of 'reiterable'. And that cannot be mechanically determined.

(Other possible complications: a resource can only be accessed by one 
connection at a time. Or it limits the frequency of connections.)

In summary: A. There are multiple iterable and iteration use cases.  B. 
We cannot really get away from documenting the requirements for iterable 
inputs and keeping some responsibility for meeting them in the hands of 
callers.

-- 
Terry Jan Reedy