[Tutor] beginner question

Tue Nov 1 17:38:23 CET 2011

Mayo Adams wrote:
> When writing a simple for loop like so:
> 
>      for x in f
> 
> where f is the name of a file object, how does Python "know" to interpret
> the variable x as a line of text, rather than,say, an individual character
> in the file? Does it automatically
> treat text files as sequences of lines?

Nice question! But the answer is a little bit complicated. The short 
answer is:

File objects themselves are programmed to iterate line by line rather 
than character by character. That is a design choice made by the 
developers of Python, and it could have been different, but this choice 
was made because it is the most useful.

The long answer requires explaining how for-loops work. When you say

     for x in THINGY: ...

Python first asks THINGY to convert itself into a iterator. It does that 
by calling the special method THINGY.__iter__(), which is expected to 
return an iterator object (which may or may not be THINGY itself). If 
there is no __iter__ method, then Python falls back on an older sequence 
protocol which isn't relevant to files. If that too fails, then Python 
raises an error.

So what's an iterator object? An iterator object must have a method 
called "next" (in Python 2), or "__next__" (in Python 3), which returns 
"the next item". The object is responsible for knowing what value to 
return each time next() is called. Python doesn't need to know anything 
about the internal details of the iterator, all it cares about is that 
when it calls THINGY.next() or THINGY.__next__(), the next item will be 
returned. All the "intelligence" is inside the object, not in Python.

When there are no more items left to return, next() should raise 
StopIteration, which the for loop detects and treats as "loop is now 
finished" rather than as an error.

So, the end result of all this is that Python doesn't care what THINGY 
is, so long as it obeys the protocol. So anyone can create new kinds of 
data that can be iterated over. In the case of files, somebody has 
already done that for you: files are built into Python.

Built-in file objects, like you get from f = open("some file", "r"), 
obey the iterator protocol. We can run over it by hand, doing exactly 
what Python does in a for-loop, only less conveniently.

Suppose we have a file containing "fee fi fo fum" split over four lines. 
Now let's iterate over it by hand. File objects are already iterators, 
so in Python 3 they have their own __next__ method and there's no need 
to call __iter__ first:

 >>> f = open('temp.txt', 'r')
 >>> f.__next__()
'fee\n'
 >>> f.__next__()
'fi\n'
 >>> f.__next__()
'fo\n'
 >>> f.__next__()
'fum\n'
 >>> f.__next__()
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
StopIteration

So the file object itself keeps track of how much of the file has been 
read, and the Python interpreter doesn't need to know anything about 
files. It just needs to know that the file object is iterable. I already 
know this, so I took a short-cut, calling f.__next__() directly. But 
Python doesn't know that, it performs one extra step: it calls 
f.__iter__ to get an iterator object:

 >>> f.__iter__()
<_io.TextIOWrapper name='temp.txt' encoding='UTF-8'>

In this case, that iterator object is f itself, and now the Python 
interpreter goes on to call __next__() repeatedly.

File objects are actually written in C for speed, but if they were 
written in pure Python, they might look something vaguely like this:

class File(object):
     def __init__(self, name, mode='r'):
         self.name = name
         if mode == 'r':
             ... # open the file in Read mode
         elif mode == 'w':
             ... # open in Write mode
         else:
             # actually there are other modes too
             raise ValueError('bad mode')

     def __iter__(self):
         return self  # I am my own iterator.

     def read(self, n=1):
         # Read n characters. All the hard work is in here.
         ...

     def readline(self):
         # Read a line, up to and including linefeed.
         buffer = []
         c = self.read()
         buffer.append(c)
         while c != '' and c != '\n':
             c = self.read()  # Read one more character.
             buffer.append(c)
         return ''.join(buffer)

     def __next__(self):
         line = self.readline()
         if line == '':
             # End of File
             raise StopIteration
         else:
             return line

-- 
Steven