[Python-ideas] Re: Documenting iterators vs. iterables [was: Adding slice Iterator ...]

May 17, 2020

      Andrew Barnert writes:
...
...
...
The answer is that files are iterators, while lists are… well,
there is no word.
As Chris B said, sure there are words:  File objects are *already*
iterators, while lists are *not*.  My question is, "why isn't that
instructive?"
Well, it’s not _completely_ not instructive, it’s just not
_sufficiently_ instructive.
Language is more useful when the concepts it names carve up the
world in the same way you usually think about it.
True.  But that doesn't mean we need names for everything.  In your
"phases of matter" example, there are two characteristics, fluidity
(which gases and liquids have, but solids don't) and compressibility
(which gases have, but neither solids nor liquids do).  Here the
tripartite vocabulary makes sense, since they're orthogonal, and (in
our modern world) all three concepts are everyday experience.
...
Yes, it’s true that we can talk about “iterables that are not
iterators”. But that doesn’t mean there’s no need for a word.
True, but that also doesn't mean there *is* need for a word.
...
We don’t technically need the word “liquid” because we could always
talk about “compressibles that are not solid” (or “fluids that are
not gas”)
True, but neither "compressibles" nor "fluids" "is a thing".  Instead,
in everyday language "fluid" is pretty much synonymous with "liquid",
and AFAIK there are no compressibles that aren't fluids, so
"compressible" is pretty much purely an adjective.  OTOH, it's useful
to pick out each phase of matter separately.

You haven't make an argument that it's useful to pick out "iterables
that aren't iterators" separately yet, except that you believe that a
word would help (which to me is evidence for the need, but not very
strong evidence).

The reason I'm quite unpersuaded is that there's also a concept of
marked vs unmarked in linguistics.  Marked concepts are explicitly
indicated; unmarked concepts require an explicit contrast with the
marked concept, or they get folded into the generic word, leaving some
ambiguity that gets resolved by context.  (This can get really
persnickety with no obvious rules even in the same domain.  For
example, with gender, "he" is unmarked, and you need to disambiguate
"male person" from "person of unknown gender" fromm context, at least
in traditional English grammar.  While "she" is marked.  By contrast,
"male" and "female" are both unambiguous.)

Now, it seems to me that we are only ever going to discuss iterators
in the context of iteration, which means our domain of discourse is
pretty much restricted to iterables.  (In the sense that there's
nothing left to discuss about iteration once you've classed an entity
as "not iterable".)  Given the way iterable and iterator are defined,
it seems perfectly reasonable to me that iterator would be marked,
non-iterator iterable left to its own devices, and the word "iterable"
disambiguated from context, or perhaps marked with some fairly clumsy
modifier.

So how can one explain "the problem with re-iterating files"?  Here's
how I would (now that I've thought more about it than I should ;-):

Student: OK, so we use 'for' to iterate over lists etc.  And it's cool
         that we can do "for line in file".  But how come if I need to
         do it twice, with lists I can just use a new 'for' statement,
         but with files nothing useful happens?
Teacher: That's a good question.  You know that "things we can use in
         a for statement" are called "iterables" right?
         Well, files are a special kind of iterable called
         "iterator", and you can "start them where you left off" with
         a new 'for' statement.
Student: But the 'for' statement runs out!  You don't want to restart
         in the middle!
Teacher: Exactly!  And that's why nothing useful happens when you use
         a second for statement on an already-open file.  But you can
         use 'break' to stop partway through.
Student: Huh?  What's that good for?
Teacher: [Gives relevant example: paragraph-wise processing in text
         files with empty line paragraph breaks, message-wise
         processing in mbox files, etc.]
Student: Well, OK.  But that's not what I expected or wanted.
Teacher: [Presses "play" on Rolling Stones tune cued up for this moment.
         Continues as voice-over.]
         True enough.  I wasn't there when they designed this
         interface to files, so I'm not sure all the reasons but I do
         find it useful for the kind of processing I described
         earlier.  Of course, you can get the effect you want by using
         'open' again.  It's a little annoying that *you* have to remember
         to do this.  Also, there is a way to reset files the way you
         want.  Just use the '.seek(0)' method on the file before the
         second 'for' statement.
Student: Hey, wait!  Suppose I wanted to "restart where I left off" in
         iterating over a list.  I guess that just doesn't work?
Teacher: [Wishes she had more students like this.]
         Another good question.  If you want to do that, you have to
         construct an iterator from the list: 'lit = iter(l)'.  Now
         iterate over 'lit', and you can break in the middle and
         restart with a new 'for' statement, just like with files.
         It's a little annoying that you have to remember ...
Student: [clobbers teacher with a handy copy of Python Essential Reference]

The point of the little dialogue is that although the word "iterator"
is used, the student only has to remember it until the end of any
sentence in which it's used.  I think the student's responses are
quite natural, and they don't mention "iterator".  I suspect this
student won't remember 'iter' but I bet she does remember '.seek(0)'.

On the other hand, what is there to explain *specifically* about
iterables that aren't iterators that explaining about iterables
doesn't do just as well?  I guess there's the inverse of the "why
doesn't it work with files?" question, but does that ever get asked?
Surely almost all students encounter iteration over sequences first,
and only later over iterators?
...
...
2.  The *for* statement and the *next* builtin require an iterator
  object to work.  Since for *always* needs an iterator object, it
  automatically converts the "in" object to an iterator implicitly.
  (Technical note: for the convenience of implementors of 'for',
  when iter is applied to an iterator, it always returns the
  iterator itself.)
...
I think this is more complicated than people need to know, or
usually learn. People use for loops almost from the start, but many
people get by with never calling next. All you need is the concept
“thing that can be used in a for loop”, which we call
“iterable”.
Conceded.  "Had I only more time, I would have written a much shorter
post."
...
“Iterable” is the fundamental concept.
We agree on this too.
...
Of course you will need to learn the concept “iterator” pretty soon
anyway, but only because Python actually gives you iterators all
over the place. [...] You want to know whether they can be used in
for loops
I think now you are over-thinking this.  Iterators *are* iterables.
You have one because somebody told you it's iterable, and you want to
use it in a 'for' loop.  You only need to know that it's an iterator
if you want to re-iterate from the beginning, rather than re-start
from where you left off.

"Iterator" is the marked case.  But the "marker" is that you find out
about it when it doesn't "do what I meant".
...
I think many people do get this, and that’s exactly what leads to
confusion. They think that “lazy” and “iterator” (or “consumed when
you loop over it”) go together. But they don’t.
I'll grant that my words admit such confusion, especially if people
are predisposed to it.  I think they are.  After all, none of your
"many people" have read my thoughts on the matter before this thread!
Just as there are times when LBYL is the appropriate programming
technique (even though EAFP is possible), sometimes people who don't
read the whole relevant manual section in advance are going to get
burned by their guesses and analogies (especially if they got them
from others of the same type).
...
...
Back to the discussion: the child can touch both, and does so
frequently (assuming you don't feed them from the dog's bowl and
also bathe them regularly).  They've seen glasses break, most
likely, and splashed water.
And someone learning Python does get to touch both things
here. They get lists, dicts, and ranges, and they get files, zips,
and enumerate. Both categories come up pretty early in learning
Python, just like both solids and liquids come up pretty early in
learning to be human.
No, they don't, in a sense I explained.  Until the student has a use
case where they need to restart (either where they left off or from
the beginning) they can't tell the difference because they just put
the whatever in a 'for' statement which works like magic -- and to
them it is pure magic, because they don't know what iterable or
iterator or __iter__ or iter or __next__ or next are.  They just know
you can use lists and some other things in a 'for' statement.  The
restart distinction may not come up for a long time.  I didn't really
have a use case for it, until one time I wanted to do something with
mbox files and I didn't like what the mailbox module does.  So I had
to roll my own.
...
No, it’s iterables whose purpose is being fed to a for statement.
I disgree, both in the abstract (Sequences are iterable, but don't
necessarily have an __iter__, and so I don't see how you can support
your assertion that their purpose is to be fed to 'for') and in the
concrete (lots of iterables with __iter__ are instantiated and never
intended to be iterated, yet are useful).  By contrast, every iterator
has an __iter__, and the technical term for an iterator that is never
iterated is "garbage".
...
Yes, iterators are what for statements use under the covers to deal
with iterables, but you don’t need to learn that until well after
you’ve learned that iterators are what you get from open and zip.
True enough, my bad.  I was confounding two documentation problems
there.  One is teaching new users, and the other is helping experts
get it exactly right.  I've mixed them up quite a bit, but my list of
5 points should be thought of as aimed at a concise but comprehensive
description rather than a tutorial.
...
You don’t have to call them “file iterators”, you just have to have
to word “iterator” lying around to teach them when they ask why
they can’t loop over a file twice. Which we do.
Eh, that's my argument. :-)
...
In the same way, you don’t need to call lists “list iterables”[.]
And there's no way that I would.  "Iterable" is an adjective.  The
usage "iterables" for the class of iterable objects is something of an
abuse.[2]  My point about files is that they're the thing I would
expect would be most folks' first unpleasant encounter with an
exhausted iterator object, and by naming them as "file iterators" you
might be able to induce a lot of "a ha!" moments.  You come around to
a related suggestion below.  I admit that the "file iterator"
suggestion is pretty implausible.
...
You just need to have the word “iterable” lying around to teach them
when they ask what other kinds of things can go in a for loop.
I don't think you meant to write that: when they ask that, you don't
say "iterables, of course", you say "tuples, sets, and perhaps
surprisingly dicts, as well as dict views, and many other things."
It's only when you or the student need a name for that whole class
that you bring up the term "iterable" (at least in its noun form).
But I don't think that comes up, at least on the student side, for
quite a while.  A good student might ask "what else is iterable?" but
"What else can I use in a 'for' statement?" is perfectly serviceable.
I suppose the teacher might find it painful to completely avoid the
term "iterable" (especially as an adjective, and "iterator", for that
matter), but I would solve that problem as in the dialog: just use
them in such a way that the student doesn't need to remember them.  I
think that's quite do-able, even natural.

I do not claim this leaves the student with a complete and
satisfactory understanding of the concept of iterator, merely that it
allows them to understand the difference between iterables that start
from where they left off and those that begin again at the beginning.
...
And you don’t need to call lists “list collections”, you just need
to have the word “collection” lying around to teach them when they
ask why ranges and lists and dicts let you loop over their values
over and over.
Have you ever been asked that, outside of the context of explaining
why files, zips, etc. don't allow re-iteration from the start?  Has
anyone come to you puzzled because the second loop over a list did
useful work?
...
...
We have that word and distinction.  A file object *is* an
iterator.  A list is *not* an iterator.  *for* works *with*
iterators internally, and *on* iterables through the magic of
__iter__.
“Not an iterator” is not a word. Of course you _can_ talk about
things that don’t have names by being circuitous, but it’s harder.
Or you can not talk about them at all.  This is very frustrating,
because I agree with everything you say as a general principle, but
your concrete discussion never refers to iterators or iterables.  It's
always an analogy to birds and reptiles and plasmas and liquids.

I think that analogy breaks down because I doubt new programmers get
confused by the fact that they can re-iterate over lists.  Like, not
ever.  I'd even bet that students who try breaking out, then
restarting where they left off, and have it fail by restarting from
the beginning, are disappointed but not shocked.  So when do you
*need* to talk about non-iterator iterables?  Outside of threads like
this one?
...
And in practice, people do need to think about “things that can be
looped over repeatedly and give you their values over and over”,
and having to say “iterables that are not iterators” may be
technically sufficient, but practically it makes communication and
thought harder.
Or you can just treat "things that can be looped over repeatedly and
give you their values over and over" as the unmarked case of "iterable",
and speak of "iterators" when you need to distinguish the marked
case.[3]  Use of "marking" is something we do all the time.  I can't
say for sure that it would work here, but nothing you've written yet
convinces me it wouldn't.
...
It means we have to be more verbose and less to the point,
It doesn't mean we *have* to be more verbose, in principle.  "Marking"
works fine in natural language, just as anaphoric "it" does.  I may be
missing something, but you need to be more concrete about what the
need for this word (yet to be named) is.
...
and people make silly mistakes like the one in the parent thread,
and people make more serious mistakes like teaching others that
ranges are iterators,
Indeed they do.  I don't think that has as much to do with people not
having a word for iterables that aren't iterators as it does with them
not understanding what an iterator is.  Just because you have a word,
say "nandaro", for iterables that aren't iterators doesn't mean that
otherwise well-informed people will correctly classify ranges as
nandaro rather than incorrectly as iterators.

As far as I can tell, most of the rest of your post addresses an
argument that I'm not making, and I don't know how to do it better, so
I'm just going to let it rest there.

As mentioned above, this captures a good bit of what I'm trying to get
at:
...
On the other hand, this would certainly get the notion of “files
are streams” across to novices (as opposed to people coming from
other languages) faster and more easily than we do today, which
might help a lot of them. It might even turn out to solve the “why
can’t I loop over this file twice” question for a lot of people in
a different way, and that different way might be something you
could build on to explain the difference between zip and
range. “Like a stream” is much more accurate than “because it wants
to be lazy”, and maybe easier to understand as well.
Footnotes: 
[1]  Or maybe "marked" doesn't apply here because those words are on
equal footing -- I'm not a linguist, I've just heard the concept
discussed by real linguists.

[2]  Linguists have a technical term for this kind of "abuse" but I
don't remember it.

[3]  I recognize that you can create objects that break this
dichotomy.  I doubt they're important enough to impede discussion for
lack of the word for "non-iterator iterables".  Again, concrete
examples would really help.

[Python-ideas] Re: Documenting iterators vs. iterables [was: Adding slice Iterator ...]

Stephen J. Turnbull