[Tutor] beginner question
Steven D'Aprano
steve at pearwood.info
Tue Nov 1 17:38:23 CET 2011
Mayo Adams wrote:
> When writing a simple for loop like so:
>
> for x in f
>
> where f is the name of a file object, how does Python "know" to interpret
> the variable x as a line of text, rather than,say, an individual character
> in the file? Does it automatically
> treat text files as sequences of lines?
Nice question! But the answer is a little bit complicated. The short
answer is:
File objects themselves are programmed to iterate line by line rather
than character by character. That is a design choice made by the
developers of Python, and it could have been different, but this choice
was made because it is the most useful.
The long answer requires explaining how for-loops work. When you say
for x in THINGY: ...
Python first asks THINGY to convert itself into a iterator. It does that
by calling the special method THINGY.__iter__(), which is expected to
return an iterator object (which may or may not be THINGY itself). If
there is no __iter__ method, then Python falls back on an older sequence
protocol which isn't relevant to files. If that too fails, then Python
raises an error.
So what's an iterator object? An iterator object must have a method
called "next" (in Python 2), or "__next__" (in Python 3), which returns
"the next item". The object is responsible for knowing what value to
return each time next() is called. Python doesn't need to know anything
about the internal details of the iterator, all it cares about is that
when it calls THINGY.next() or THINGY.__next__(), the next item will be
returned. All the "intelligence" is inside the object, not in Python.
When there are no more items left to return, next() should raise
StopIteration, which the for loop detects and treats as "loop is now
finished" rather than as an error.
So, the end result of all this is that Python doesn't care what THINGY
is, so long as it obeys the protocol. So anyone can create new kinds of
data that can be iterated over. In the case of files, somebody has
already done that for you: files are built into Python.
Built-in file objects, like you get from f = open("some file", "r"),
obey the iterator protocol. We can run over it by hand, doing exactly
what Python does in a for-loop, only less conveniently.
Suppose we have a file containing "fee fi fo fum" split over four lines.
Now let's iterate over it by hand. File objects are already iterators,
so in Python 3 they have their own __next__ method and there's no need
to call __iter__ first:
>>> f = open('temp.txt', 'r')
>>> f.__next__()
'fee\n'
>>> f.__next__()
'fi\n'
>>> f.__next__()
'fo\n'
>>> f.__next__()
'fum\n'
>>> f.__next__()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
So the file object itself keeps track of how much of the file has been
read, and the Python interpreter doesn't need to know anything about
files. It just needs to know that the file object is iterable. I already
know this, so I took a short-cut, calling f.__next__() directly. But
Python doesn't know that, it performs one extra step: it calls
f.__iter__ to get an iterator object:
>>> f.__iter__()
<_io.TextIOWrapper name='temp.txt' encoding='UTF-8'>
In this case, that iterator object is f itself, and now the Python
interpreter goes on to call __next__() repeatedly.
File objects are actually written in C for speed, but if they were
written in pure Python, they might look something vaguely like this:
class File(object):
def __init__(self, name, mode='r'):
self.name = name
if mode == 'r':
... # open the file in Read mode
elif mode == 'w':
... # open in Write mode
else:
# actually there are other modes too
raise ValueError('bad mode')
def __iter__(self):
return self # I am my own iterator.
def read(self, n=1):
# Read n characters. All the hard work is in here.
...
def readline(self):
# Read a line, up to and including linefeed.
buffer = []
c = self.read()
buffer.append(c)
while c != '' and c != '\n':
c = self.read() # Read one more character.
buffer.append(c)
return ''.join(buffer)
def __next__(self):
line = self.readline()
if line == '':
# End of File
raise StopIteration
else:
return line
--
Steven
More information about the Tutor
mailing list