urllib2 - iteration over non-sequence
gagsl-py2 at yahoo.com.ar
Sun Jun 10 08:47:25 CEST 2007
En Sun, 10 Jun 2007 02:54:47 -0300, Erik Max Francis <max at alcyone.com>
> Gary Herron wrote:
>> Certainly there's are cases where xreadlines or read(bytecount) are
>> reasonable, but only if the total pages size is *very* large. But for
>> most web pages, you guys are just nit-picking (or showing off) to
>> suggest that the full read implemented by readlines is wasteful.
>> Moreover, the original problem was with sockets -- which don't have
>> xreadlines. That seems to be a method on regular file objects.
> There is absolutely no reason to read the entire file into memory (which
> is what you're doing) before processing it. This is a good example of
> the principle of there is one obvious right way to do it -- and it isn't
> to read the whole thing in first for no reason whatsoever other than to
> avoid an `x`.
The problem is -and you appear not to have noticed that- that the object
returned by urlopen does NOT have a xreadlines() method; and even if it
had, a lot of pages don't contain any '\n' so using xreadlines would read
the whole page in memory anyway.
Python 2.2 (the version that the OP is using) did include a xreadlines
module (now defunct) but on this case it is painfully slooooooooooooow -
perhaps it tries to read the source one character at a time.
So the best way would be to use (as Paul Rubin already said):
for line in iter(lambda: f.read(4096), ''): print line
More information about the Python-list