[Tutor] f.readlines(size)
Peter Otten
__peter__ at web.de
Tue Jun 6 03:35:52 EDT 2017
Nancy Pham-Nguyen wrote:
> Hi,
Hi Nancy, the only justification for the readlines() method is to serve as a
trap to trick newbies into writing scripts that consume more memory than
necessary. While the size argument offers a way around that, there are still
next to no use cases for readlines.
Iterating over a file directly is a very common operation and a lot of work
to make it efficient was spent on it. Use it whenever possible.
To read groups of lines consider
# last chunk may be shorter
with open(FILENAME) as f:
while True:
chunk = list(itertools.islice(f, 3))
if not chunk:
break
process_lines(chunk)
or
# last chunk may be filled with None values
with open(FILENAME) as f:
for chunk in itertools.zip_longest(f, f, f): # Py2: izip_longest
process_lines(chunk)
In both cases you will get chunks of three lines, the only difference being
the handling of the last chunk.
> I'm trying to understand the optional size argument in file.readlines
> method. The help(file) shows: | readlines(...) | readlines([size])
> -> list of strings, each a line from the file. | | Call
> readline() repeatedly and return a list of the lines so read. | The
> optional size argument, if given, is an approximate bound on the |
> total number of bytes in the lines returned. From the
> documentation:f.readlines() returns a list containing all the lines of
> data in the file. If given an optional parameter sizehint, it reads that
> many bytes from the file and enough more to complete a line, and returns
> the lines from that. This is often used to allow efficient reading of a
> large file by lines, but without having to load the entire file in memory.
> Only complete lines will be returned. I wrote the function below to try
> it, thinking that it would print multiple times, 3 lines at a time, but it
> printed all in one shot, just like when I din't specify the optional
> argument. Could someone explain what I've missed? See input file and
> output below. Thanks,Nancy
> def readLinesWithSize():
> # bufsize = 65536
> bufsize = 45
> with open('input.txt') as f: while True:
> # print len(f.readlines(bufsize)) # this will print 33
> print
> lines = f.readlines(bufsize) print lines
> if not lines: break for line in lines:
> pass readLinesWithSize() Output:
This seems to be messed up a little by a "helpful" email client. Therefore
I'll give my own:
$ cat readlines_demo.py
LINESIZE=32
with open("tmp.txt", "w") as f:
for i in range(30):
f.write("{:02} {}\n".format(i, "x"*(LINESIZE-4)))
BUFSIZE = LINESIZE*3-1
print("bufsize", BUFSIZE)
with open("tmp.txt", "r") as f:
while True:
chunk = f.readlines(BUFSIZE)
if not chunk:
break
print(sum(map(len, chunk)), "bytes:", chunk)
$ python3 readlines_demo.py
bufsize 95
96 bytes: ['00 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '01
xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '02 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n']
96 bytes: ['03 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '04
xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '05 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n']
96 bytes: ['06 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '07
xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '08 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n']
...
So in Python 3 this does what you expect, readlines() stops collecting more
lines once the total number of bytes exceeds those specified.
"""
readlines(...) method of _io.TextIOWrapper instance
Return a list of lines from the stream.
hint can be specified to control the number of lines read: no more
lines will be read if the total size (in bytes/characters) of all
lines so far exceeds hint.
"""
In Python 2 the docstring is a little vague
"""
The optional size argument, if given, is an *approximate* *bound* on the
total number of bytes in the lines returned.
"""
(emphasis mine) and it seems that small size values which defeat the goal of
making the operation efficient are ignored:
$ python readlines_demo.py
('bufsize', 95)
(960, 'bytes:', ['00 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '01
xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '28 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '29
...
xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n'])
Playing around a bit on my system the minimum value with an effect seems to
be about 2**13, but I haven't consulted the readlines source code to verify.
More information about the Tutor
mailing list