[Tutor] f.readlines(size)

Tue Jun 6 03:35:52 EDT 2017

Nancy Pham-Nguyen wrote:

> Hi,

Hi Nancy, the only justification for the readlines() method is to serve as a 
trap to trick newbies into writing scripts that consume more memory than 
necessary. While the size argument offers a way around that, there are still 
next to no use cases for readlines.

Iterating over a file directly is a very common operation and a lot of work 
to make it efficient was spent on it. Use it whenever possible.

To read groups of lines consider

# last chunk may be shorter
with open(FILENAME) as f:
    while True:
        chunk = list(itertools.islice(f, 3))
        if not chunk:
            break
        process_lines(chunk)

or 

# last chunk may be filled with None values
with open(FILENAME) as f:
    for chunk in itertools.zip_longest(f, f, f): # Py2: izip_longest
        process_lines(chunk)

In both cases you will get chunks of three lines, the only difference being 
the handling of the last chunk.

> I'm trying to understand the optional size argument in file.readlines
> method. The help(file) shows: |  readlines(...) |      readlines([size])
> -> list of strings, each a line from the file. |       |      Call
> readline() repeatedly and return a list of the lines so read. |      The
> optional size argument, if given, is an approximate bound on the |     
> total number of bytes in the lines returned. From the
> documentation:f.readlines() returns a list containing all the lines of
> data in the file. If given an optional parameter sizehint, it reads that
> many bytes from the file and enough more to complete a line, and returns
> the lines from that. This is often used to allow efficient reading of a
> large file by lines, but without having to load the entire file in memory.
> Only complete lines will be returned. I wrote the function below to try
> it, thinking that it would print multiple times, 3 lines at a time, but it
> printed all in one shot, just like when I din't specify the optional
> argument. Could someone explain what I've missed? See input file and
> output below. Thanks,Nancy 

> def readLinesWithSize():
>     # bufsize = 65536
>     bufsize = 45      
>     with open('input.txt') as f:         while True:        
>         # print len(f.readlines(bufsize))   # this will print 33           
> print             
> lines = f.readlines(bufsize)             print lines    
>         if not lines:                 break             for line in lines:
>                 pass      readLinesWithSize() Output:

This seems to be messed up a little by a "helpful" email client. Therefore 
I'll give my own:

$ cat readlines_demo.py
LINESIZE=32
with open("tmp.txt", "w") as f:
    for i in range(30):
        f.write("{:02} {}\n".format(i, "x"*(LINESIZE-4)))

BUFSIZE = LINESIZE*3-1
print("bufsize", BUFSIZE)

with open("tmp.txt", "r") as f:
    while True:
        chunk = f.readlines(BUFSIZE)
        if not chunk:
            break
        print(sum(map(len, chunk)), "bytes:", chunk)
$ python3 readlines_demo.py
bufsize 95
96 bytes: ['00 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '01 
xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '02 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n']
96 bytes: ['03 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '04 
xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '05 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n']
96 bytes: ['06 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '07 
xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '08 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n']
...

So in Python 3 this does what you expect, readlines() stops collecting more 
lines once the total number of bytes exceeds those specified.

"""
readlines(...) method of _io.TextIOWrapper instance
    Return a list of lines from the stream.

    hint can be specified to control the number of lines read: no more
    lines will be read if the total size (in bytes/characters) of all
    lines so far exceeds hint.
"""

In Python 2 the docstring is a little vague

"""
The optional size argument, if given, is an *approximate* *bound* on the
total number of bytes in the lines returned.
"""

(emphasis mine) and it seems that small size values which defeat the goal of 
making the operation efficient are ignored:

$ python readlines_demo.py
('bufsize', 95)
(960, 'bytes:', ['00 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '01 
xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '28 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n', '29
...
 xxxxxxxxxxxxxxxxxxxxxxxxxxxx\n'])

Playing around a bit on my system the minimum value with an effect seems to 
be about 2**13, but I haven't consulted the readlines source code to verify.