On May 14, 2020, at 20:17, Stephen J. Turnbull
Andrew Barnert writes:
Students often want to know why this doesn’t work: with open("file") as f: for line in file: do_stuff(line) for line in file: do_other_stuff(line)
Sure. *Some* students do. I've never gotten that question from mine, though I do occasionally see
with open("file") as f: for line in f: # ;-) do_stuff(line) with open("file") as f: for line in f: do_other_stuff(line)
I don't know, maybe they asked the student next to them. :-)
Or they got it off StackOverflow or Python-list or Quora or wherever. Those resources really do occasionally work as intended, providing answers to people who search without them having to ask a duplicate question. :)
The answer is that files are iterators, while lists are… well, there is no word.
As Chris B said, sure there are words: File objects are *already* iterators, while lists are *not*. My question is, "why isn't that instructive?"
Well, it’s not _completely_ not instructive, it’s just not _sufficiently_ instructive. Language is more useful when the concepts it names carve up the world in the same way you usually think about it. Yes, it’s true that we can talk about “iterables that are not iterators”. But that doesn’t mean there’s no need for a word. We don’t technically need the word “liquid” because we could always talk about “compressibles that are not solid” (or “fluids that are not gas”); we don’t need the word “bird” because we could always talk about “diapsids that are not reptiles”; etc. Theoretically, English could express all the same propositions and questions and so on that it does today without those words. But practically, it would be harder to communicate with. And that’s why we have the words “bird” and “liquid”. And the reason we don’t have a word for all diapsids except birds and turtles is that we don’t need to communicate about that category. Natural languages get there naturally; jargon sometimes needs help.
We shouldn’t define everything up front, just the most important things. But this is one of the most important things. People need to understand this distinction very early on to use Python,
No, they don't. They neither understand, nor (to a large extent) do they *need* to.
ISTM that all we need to say is that
1. An *iterator* is a Python object whose only necessary function is to return an object when next is applied to it. Its purpose is to keep track of "next" for *for*. (It might do other useful things for the user, eg, file objects.)
2. The *for* statement and the *next* builtin require an iterator object to work. Since for *always* needs an iterator object, it automatically converts the "in" object to an iterator implicitly. (Technical note: for the convenience of implementors of 'for', when iter is applied to an iterator, it always returns the iterator itself.)
I think this is more complicated than people need to know, or usually learn. People use for loops almost from the start, but many people get by with never calling next. All you need is the concept “thing that can be used in a for loop”, which we call “iterable”. Once you know that, everything else in Python that loops is the same as a for loop—the inputs to zip and enumerate are iterables, because they get looped over. “Iterable” is the fundamental concept. Yeah, it sucks that it has such a clumsy word, but at least it has a word. You don’t need the concept “iterator” here, much less need to know that looping uses iterables by calling iter() to get an iterator and then calling next() until StopIteration, until you get to the point of needing to read or write some code that iterates manually. Of course you will need to learn the concept “iterator” pretty soon anyway, but only because Python actually gives you iterators all over the place. In a language (like Swift) where zip and enumerate were views, files weren’t iterable at all, etc., you wouldn’t need the concept “iterator” until very late, but in Python it shows up early. But you still don’t need to learn about next(); that’s as much a technical detail as the fact that they return self from iter(). You want to know whether they can be used in for loops—and they can, because (unlike in Swift) iterators are iterable, and you already understand that.
3. When a "generic" iterator "runs out", it's exhausted, it's truly done. It is no longer useful, and there's nothing you can do but throw it away. Generic iterators do not have a reset method. Specialized iterators may provide one, but most do not.
Yes, this is the next thing you need to know about iterators. But you also need to know that many iterables don’t get consumed in this way. Lists, ranges, dicts, etc. do _not_ run out when you use them in a for loop. There’s a wide range of things you use every day that can be looped over repeatedly. And they all act the same way—each time you loop over them, you get all of their contents, from start to finish. That isn’t part of the Iterable protocol, or the concept underneath it. It can’t be, because it’s not true for some common iterables, like all iterators. People try to guess at what that concept is, and that’s where they run into problems. Because:
5. Most Python objects are not iterators, but many can be converted. However, some Python objects are constructed as iterators because they want to be "lazy". Examples are files (so that a huge file can be processed line by line without reading the whole thing into memory) and "generators" which yield a new item each time they are called.
But AFAIK we *do* say that, and it doesn't get through.
I think many people do get this, and that’s exactly what leads to confusion. They think that “lazy” and “iterator” (or “consumed when you loop over it”) go together. But they don’t. If you learned that “some Python objects are constructed as iterators because they want to be lazy”, and you know ranges are lazy, you’re liable to think that ranges are consumed when you loop over them, and if they know the term “iterator”, they’ll apply it to ranges (as so many people do—even people writing blog posts and StackOverflow answers). And if you think of files as _not_ lazy—because, after all, the lines do exist in advance on disk—then you expect them to be reusable in for loops, just like lists and dicts. (If you think about socket.makefile() or open('/dev/random') that would probably disabuse you of the notion, but how many novices are using those files?) You could explain this by further refining the concept of “lazy” to explain that files are lazy in the sense of processing, or heap usage, or something, not just ontological existence or whatever. But that’s pretty complicated. And it’s ultimately misleading, because it still gives people the wrong answer for ranges.
I can teach a child why a glass will break permanently when you hit it while a lake won’t by using the words “solid” and “liquid”.
Terrible example, since a glass is just a geologically slow liquid. ;-)
No, a glass is a solid. It doesn’t flow (except in the very loose sense that all solids do). And even if that factoid weren’t false, it would be a fact about physicists’ jargon, not about the everyday words. If I ask you to bring a fruit salad to the potluck and you show up with tomatoes, peas, peanuts, wheat grains, and eggplants but no strawberries, nobody is going to be impressed.
Back to the discussion: the child can touch both, and does so frequently (assuming you don't feed them from the dog's bowl and also bathe them regularly). They've seen glasses break, most likely, and splashed water.
And someone learning Python does get to touch both things here. They get lists, dicts, and ranges, and they get files, zips, and enumerate. Both categories come up pretty early in learning Python, just like both solids and liquids come up pretty early in learning to be human.
Iterators have one overriding purpose: to be fed to *for* statements, be exhausted, and then discarded. This is so important that it's done implicitly and in every single *for* statement. We have the necessary word, "iterator," but students don't have the necessary experience of "touching" the iterator that *for* actually iterates over instead of the list that is explicit in the *for* statement. That iterator is created implicitly and becomes garbage as soon as the *for* statement. And there's no way for the student to touch it, it doesn't have a name!
No, it’s iterables whose purpose is being fed to a for statement. Yes, iterators are what for statements use under the covers to deal with iterables, but you don’t need to learn that until well after you’ve learned that iterators are what you get from open and zip.
If you want to fix nomenclature, don't call them "files," don't call them "file objects," call them "file iterators". Then students have an everyday iterator they can touch. I'll guarantee that causes other problems, though, and gets a ton of resistence. Even from me. :-)
You don’t have to call them “file iterators”, you just have to have to word “iterator” lying around to teach them when they ask why they can’t loop over a file twice. Which we do. In the same way, you don’t need to call lists “list iterables”, you just need to have the word “iterable” lying around to teach them when they ask what other kinds of things can go in a for loop. (As either you or Christopher said, it’s not a great word, but that’s another problem.) And you don’t need to call lists “list collections”, you just need to have the word “collection” lying around to teach them when they ask why ranges and lists and dicts let you loop over their values over and over. And that’s the word we don’t have. Which is why people keep trying to use the word “sequence” when it isn’t appropriate (calling a dict a sequence is very misleading—and range/xrange had the same problem before 3.2), or talk about “laziness” when it’s the wrong concept (ranges are lazy), etc. And it’s why I used the word “collection” even though it’s also incorrect, and had to follow up later in this paragraph to clarify, because not all of these things are sized containers (and maybe even vice-versa?), but that’s what “collection” means in Python. Because we have a concept and we don’t have a word for it.
Yes, and defining terminology for the one distinction that almost always is relevant helps distinguish that distinction from the other ones that rarely come up. Most people (especially novices) don’t often need to think about the distinction between iterables that are sized and also containers vs. those that are not both sized and containers, so the word for that doesn’t buy us much. But the distinction between iterators and things-like-list-and-so-on comes up earlier, and a lot more often, so a word for that would buy us a lot more.
We have that word and distinction. A file object *is* an iterator. A list is *not* an iterator. *for* works *with* iterators internally, and *on* iterables through the magic of __iter__.
“Not an iterator” is not a word. Of course you _can_ talk about things that don’t have names by being circuitous, but it’s harder. In theory, you could build a language out of any set of categories that carve up the world, and build all of the rest by composition. We don’t need the word “bird” when we could say “diapsids that aren’t reptiles”, or “liquid” when we could say “compressed matter that isn’t solid” or “fluid that isn’t gas or plasma”. Such a language would technically be able to discuss all the same things as English—but it would make communication much harder. And thinking clearly, too—human brains work better when the categories picked out by language are a rough match for the categories they need to think about than when they aren’t. And in practice, people do need to think about “things that can be looped over repeatedly and give you their values over and over”, and having to say “iterables that are not iterators” may be technically sufficient, but practically it makes communication and thought harder. It means we have to be more verbose and less to the point, and people make silly mistakes like the one in the parent thread, and people make more serious mistakes like teaching others that ranges are iterators, and then having to speak circuitously makes it harder to explain their mistakes to them.
But you *don't* use seek(0) on files (which are not iterators, and in fact don't actually exist inside of Python, only names for them do). You use them on opened *file objects* which are iterators. A file object is a file, in the same way that a list object is a list and an int object is an int.
No, it's not the same: your level of abstraction is so high that you've lost sight of the iterable/iterator distinction. All of the latter objects own their own data in a way that a file object does not. All of the latter objects are different from their iterators (where such iterators exist), while the file object is not.
That really is the wrong distinction, both at the novice level and at the Python-ideas level. You’re talking about laziness again. And while (nearly) all iterators are lazy, not all lazy things are iterators. In what sense does a range own its data? It doesn’t store it anywhere; it creates it in demand by doing arithmetic on the things it actually does store. If you’re really careful you can sort of explain that one, but then in what sense does a dict_keys or a memoryview or an mmap “own” its data that a file doesn’t? And yet, they all work like lists.
The fact that we use “file” ambiguously for a bunch of related but contradictory abstractions (a stream that you can read or write, a directory entry, the thing an inode points to, a document that an app is working on, …) makes it a bit more confusing, but unfortunately that ambiguity is forced on people before they even get to their first attempt at programming, so it’s probably too late for Python to help (or hurt).
Agreed. I would be much happier if we could discuss an example that is *not* iterating over files but *does* come up every day on StackOverflow. Maybe zips would work but I'm not sure the motivation comes together the way it does for files (why do zips want to be lazy? what are the compelling examples for zip of "restarting the iteration where you left off" with a new *for* statement?)
I think zips want to be lazy for exactly the same reason dict_items want to be lazy. People had real-life code that was wasting too much time or space building a list that was usually only going to be used for a single pass through a loop, so Python fixed that by making them lazy. But notice that one of them is an iterator and the other isn’t. So the distinction between the two isn’t about laziness. So why are zips lazy iterators instead of lazy views? I think it comes down to historical reasons and implementation simplicity. Designing a view for zip would be harder than for dict.items (see Swift for evidence) because its inputs are so much more general. A lot of tricky questions come up about both the API design and the implementation, that all have obvious answers for dict_items but not for zip. Meanwhile, zip was invented as itertools.izip, and itertools is… well, it’s right there in the name. And it was invented before Python has lots of other views to inspire it. So, it’s no surprise that it was an iterator. And even when 3.0 came along, it was a lot easier to say “let’s move izip, ifilter, and imap out of itertools and replace the old list-producing functions” than to design something entirely new, which, in the absence of really compelling need for something entirely new, should have won out, and did.
Lists, sets, ranges, dict_keys, etc. are not iterators. You can write `for x in xs:` over and over and get the values over and over. Because each time, you get a new iterator over their values.
You and I know that, because we know what an iterator is, and we know it's there because it has to be: *for* doesn't iterate anything but an iterator. But (except via a bytecode-level debugger) nobody has ever seen that iterator. You can use iter to get a similar iterator, of course, but it's not the same object that any for statement ever used. (Unless you explicitly created it with iter, but then you can re-run the for statement on it the way you do with a list.)
This is exactly why I wouldn’t explain it to a novice in terms of “for doesn’t iterate anything but an iterator”. Sure, you and I know that it does something nearly equivalent to calling iter() and then calls next() on the result until it receives a StopIteration, but that’s not why lists can be used in for loops; that’s just how Python does it. And in fact, if CPython had special-case opcodes for looping over old-style sequences or SequenceFast C sequences without ever creating the iterator, it wouldn’t change the visible behavior. In fact, under the covers, some C functions (like, IIRC, tuple.__new__) that accept any iterable do exactly that. It doesn’t change their observable behavior, so nobody needs to know. Of course when talking to you, or to Python-ideas, I can count on the fact that you know that iterators return self from iter(), and that “like a for loop” means “as if calling iter() and then calling next() repeatedly until an exception and swallowing the exception if it’s StopIteration”, but I don’t expect everyone who uses Python to know all of that.
Files, maps, zips, generators, etc. are not like that. They’re iterators. If you write `for x in xs:` twice, you get nothing the second time, because each time you’re using the same iterator, and you’ve already used it up. Because iter(xs) is xs when it’s a file or generator etc.
Genexps are iterators, but generators (in the sense of the product of a def that contains "yield") are not even iterable. Those are iterator factories
The word “generator” is ambiguous. The type with the name “generator” that’s publicly available as “types.GeneratorType” and testable with inspect.isgenerator and that has the attributes like gi_frame that the docs say all generators have, those are generator iterators. And the things testable with .__code__.co_flags & CO_GENERATOR, those are generator functions. They’re both called “generator” so often that you have to be careful to say “generator iterator” or “generator function” when it’s not clear from context which one you mean, but I think it’s pretty clear from the context “generators are iterators” and “if you write for x in xs:” and so on which one I meant.
The only representation of files in Python is file objects—the thing you get back from open (or socket.makefile or io.StringIO or whatever else)—and those are iterators.
The thought occurred to me, "What if that was a bad decision? Maybe in principle files shouldn't be iterators, but rather iterables with a real __iter__ that creates the iterable." I realized that I'd already answered my own question in part: I find it easy to imagine cases where I'd want to get some lines of input from a file as a higher-level unit, then stop and do some processing. The killer app for me is mbox files. Another plausible case is reading top-level Lisp expressions from a file (although that doesn't necessarily divide neatly into lines.) I also found it surprisingly complicated to think about the consequences to the type of making that change.
I think there’s an easier way to see why this was a good decision: because files have positions. (Or, if you prefer, because files are streams, which implies that they have positions.) We don’t have a read_at(pos, size) method, we have a read(size) method that reads from where you left of. Seeking does exist, but it’s secondary—and it works by changing where the file thinks it left off. Once you think of files as things that know where they are, it makes more sense to wrap an iterator, rather than a reusable iterable, around them. You could argue that having a position was a bad idea in the first place, that Python shouldn’t have done it just because C stdio does it (and Unix kernels make it easy). Sure, that would mean we couldn’t use sockets and pipes as files and it would be weird to deal with special Unix files like /dev/random, but none of those things are exactly fundamental to novices. And we could even have two abstractions—a “stream” is what we call a file today, a “file” is a higher-level thing that you can randomly access, iterate repeatably, or ask for a stream from, and novices would only have to learn files rather than streams (until they have to do something like dealing with an mbox). But nearly every other language and platform in use today does the same thing as Python (and C and UNIX). If you know FILE*, NSFileHandle, or whatever the thing is called in bash, PHP, Ruby, C#, Go, Lisp, etc., a Python file is the exact same thing. And vice versa. And if you need to deal with native Win32 file handles via pywin32, they work pretty much the same way as the files you already know; you just have to know how to change the spelling of some of the functions. And so on. That’s worth a lot. (Plus, two abstractions is always more to learn than one.)
Going back to the documentation theme, maybe one way to approach explaining iterators is to start with the use case of files as (non-seekable) streams, show how 'for iteration' can be "restarted" where you left off in the file, and teach that "this is the canonical behavior of iterators; lists etc are *iterable* because 'for' automatically converts them to iterators "behind the scenes".
I still think this is getting it backward. Iterating lists is more fundamental that iterating files. Possibly even iterating ranges is. And you don’t have to understand that it works by converting them to iterators to understand it. And even if you do understand that, it doesn’t really solve the problem, because “convert to an iterator behind the scenes” doesn’t really tell you that you can do that repeatedly and get independent results. Most other cases where Python converts something behind the scenes, like adding 2 to a float or a Fraction, this doesn’t matter. Nobody cares whether each time you add 2 you get the same 2.0 or a different one, or whether each time you write the same string to a text file you get the same UTF-8 bytes or a new one. Iterators probably aren’t the _only_ exception to that, but I’m pretty sure they’re the first one many people run into. On the other hand, this would certainly get the notion of “files are streams” across to novices (as opposed to people coming from other languages) faster and more easily than we do today, which might help a lot of them. It might even turn out to solve the “why can’t I loop over this file twice” question for a lot of people in a different way, and that different way might be something you could build on to explain the difference between zip and range. “Like a stream” is much more accurate than “because it wants to be lazy”, and maybe easier to understand as well.