[Python-bugs-list] [ python-Bugs-524804 ] breaking file iter loop leaves file in stale state

Fri, 08 Mar 2002 01:08:14 -0800

Bugs item #524804, was opened at 2002-03-02 16:44
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=524804&group_id=5470

Category: Python Library
Group: Python 2.2
Status: Open
Resolution: None
Priority: 5
Submitted By: Just van Rossum (jvr)
Assigned to: Guido van Rossum (gvanrossum)
>Summary: breaking file iter loop leaves file in stale state

Initial Comment:
Given a file created with this snippet:

  >>> f = open("tmp.txt", "w")
  >>> for i in range(10000):
  ...     f.write("%s\n" % i)
  ... 
  >>> f.close()

Iterating over a file multiple times has unexpected 
behavior:

  >>> f = open("tmp.txt")
  >>> for line in f:
  ...     print line.strip()
  ...     break
  ... 
  0
  >>> for line in f:
  ...     print line.strip()
  ...     break
  ... 
  1861
  >>> 

I expected the last output line to be 1 instead of 
1861.

While I understand the cause (xreadlines being 
used by the
file iterator, it reads a big chunk ahead, causing 
the actual
filepos to be out of sync), this seems to be an 
undocumented
gotcha. The docs say this:

  [ ... ] Each iteration returns the same result as
  file.readline(), and iteration ends when the 
readline()
  method returns an empty string. 

That is true within one for loop, but not when you 
break out
of the loop and start another one, which I think is a 
valid
idiom.

Another example of breakage:

  f = open(...)
  for line in f:
      if somecondition(line):
	  break
      ...

  data = f.read()  # read rest in one slurp

The fundamental problem IMO is that the file 
iterator stacks
*another* state on top of an already stateful object. 
In a
sense a file object is already an iterator. The two 
states get
out of sync, causing confusing semantics, to say 
the least.
The current behavior exposes an implementation 
detail that
should be hidden.

I understand that speed is a major issue here, so 
a solution
might not be simple.

Here's a report from an actual user:
http://groups.google.com/groups?hl=en&selm=
owen-
0B3ECB.10234615022002%40nntp2.u.washingto
n.edu
The rest of the thread suggests possible 
solutions.

Here's what I *think* should happen (but: I'm 
hardly aware
of both the fileobject and xreadline innards) is this:
xreadlines should be merged with the file object. 
The buffer
that xreadlines implements should be *the* buffer 
for the
file object, and *all* read methods should use *
that* buffer
and the according filepos.

Maybe files should grow a .next() method, so iter(f) 
can return
f itself. .next() and .readline() are then 100% 
equivalent.

----------------------------------------------------------------------

>Comment By: Just van Rossum (jvr)
Date: 2002-03-08 10:08

Message:
Logged In: YES 
user_id=92689

At the cost of, what, sensible, predictable semantics?
- fast is better than slow
- but slow is better than unpredictable
Or something...

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2002-03-08 02:16

Message:
Logged In: YES 
user_id=31435

I'm sure Guido was aware of this.  Making the simplest-to-
spell idiom as fast as possible was a deliberate decision 
at the time.

----------------------------------------------------------------------

Comment By: Jeremy Hylton (jhylton)
Date: 2002-03-08 02:06

Message:
Logged In: YES 
user_id=31392

If I understand the checkin message Guido wrote for 2.113, 
he didn't intend the current behavior.

> file_getiter(): make iter(file) be equivalent to 
>file.xreadlines().
> This should be faster.
>
> This means:
>
> (1) "for line in file:" won't work if the xreadlines 
module can't be
>    imported.
>
> (2) The body of "for line in file:" shouldn't use the 
file directly;
>     the effects (e.g. of file.readline(), file.seek() or 
even
>     file.tell()) would be undefined because of the 
buffering that goes
>     on in the xreadlines module.

----------------------------------------------------------------------

Comment By: Jason Orendorff (jorend)
Date: 2002-03-04 07:47

Message:
Logged In: YES 
user_id=18139

Agreed on all points of fact.  Also +1 on fixing it
by making iter(f).next() and f.readline() equivalent,
one way or another.

...The easy way: make f.__iter__() call readline()
instead of being so aggressive.  (Better than nothing,
in my view.)

...The hard way (JvR's proposal): add a level of input
buffering on top of what the C runtime provides.
xreadlines() breaks user expectations precisely
because it does this *halfway*.  Doing it right would
not be such a maintenance burden, I think.  I'm willing
to write the patch, although others wiser in the ways
of diverse stdio implementations (<wink>) might want
to supervise.

...As it stands, iter(f) seems like a broken
optimization.  Which is to say: it's not "undocumented
surprising behavior"; it's a bug.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=524804&group_id=5470