[Python-Dev] Re: Reiterability

Sun Oct 19 12:30:15 EDT 2003

> > A problem I have with making iterator cloning a standard option is
> > that this would pretty much require that all iterators for which
> > cloning can be implemented should implement clone().  That in turn
> > means that iterator implementors have to work harder (sometimes
> > cloning can be done cheaply, but it might require a different
> > refactoring of the iterator implementation).
> 
> Making iterator authors aware of their clients' possible need to clone
> doesn't sound bad to me.  There's no _compulsion_ to provide the
> functionality, but some "social pressure" to do it if a refactoring can
> afford it, well, why not?

Well, since it can't be done for the very important class of
generators, I think it's better to prepare the users of all iterators
for their non-reiterability.  It would surely be a shame if the social
pressure to provide cloning ended up making generators second-class
citizens!

> > I'd like to hear more about those cases, to see if they really need
> > cloning (:-) or can live with a fixed limited backup capability.
> 
> I have an iterator it whose items, after an arbitrary prefix terminated by 
> the first empty item, are supposed to be each 'yes' or 'no'.

This is a made-up toy example, right?  Does it correspond with
something you've had to do in real life?

> I need to process it with different functions depending if it has certain 
> proportions of 'yes'/'no' (and yet another function if it has any invalid 
> items) -- each of those functions needs to get the iterator from right
> after that 'first empty item'.
> 
> Today, I do:
> 
> def dispatchyesno(it, any_invalid, selective_processing):
>     # skip the prefix
>     for x in it:
>         if not x: break
>     # snapshot the rest
>     snap = list(it)
>     it = iter(snap)
>     # count and check
>     yeses = noes = 0
>     for x in it:
>         if x=='yes': yeses += 1
>         elif x=='no': noes += 1
>         else: return any_invalid(snap)
>     total = float(yeses+noes)
>     if not total: raise ValueError, "sequence empty after prefix"
>     ratio = yeses / total
>     for threshold, function in selective_processing:
>         if ratio <= threshold: return function(snap)
>     raise ValueError, "no function to deal with a ratio of %s" % ratio
> 
> (yes, I could use bisect, but the number of items in selective_processing
> is generally quite low so I didn't bother).
> 
> Basically, I punt and "snapshot" by making a list out of what is left of
> my iterator after the prefix.  That may be the best I can do in some cases,
> but in others it's a waste.  (Oh well, at least infinite iterators are not a
> consideration here, since I do need to exhaust the iterator to get the
> ratio:-).  What I plan to do if this becomes a serious problem in the
> future is add something like an optional 'clone=None' argument so I
> can code:
>     if clone is None:
>         snap = list(it)
>         it = iter(snap)
>     else: snap = clone(it)
> instead of what I have hardwired now.  But, I _would_ like to just do, e.g.:
>     try: snap = it.clone()
>     except AttributeError:
>         snap = list(it)
>         it = iter(snap)
> using some standardized protocol for "easily clonable iterators" rather
> than requiring such awareness of the issue on the caller's part.

Is this from a real app?  What it most reminds me of is parsing email
messages that can come either from a file or from a pipe; often you
want to scan the body to find the end of its MIME structure and then
go back and do things to the various MIME parts.  If you know it comes
from a real file, it's easy to save the file offsets for the parts as
you parse them; but when it's a pipe, that doesn't work.  In practice,
these days, the right thing to do is probably to save the data read
from a pipe to a temp file first, and then parse the temp file; or if
you insist on parsing it as it comes it, copy the data to a temp file
as you go and save file offsets in the temp file.

But I'm not sure that abstracting this away all the way to an iterator
makes sense.  For one, the generic approach to cloning if the iterator
doesn't have __clone__ would be to make a memory copy, but in this app
a disk copy is desirable (I can invent something that overflows to
disk abouve a certain limit, but it's cumbersome, and you have cleanup
issues, and it needs parameterization since not everybody agrees on
when to spill to disk).  Another issue is that the application does't
require iterating over the clone and the original iterator
simultaneously, but a generic auto-cloner can't assume that; for
files, this would either mean that each clone must have its own file
descriptor (and dup() doesn't cut it because it shares the file
offset), or each clone must keep a file offset, but now you lose the
performance effect of a streaming buffer unless you code up something
extremely hairy with locks etc.

--Guido van Rossum (home page: http://www.python.org/~guido/)