[New-bugs-announce] [issue37036] Iterating a text file by line should not implicitly disable tell

Josh Rosenberg report at bugs.python.org
Fri May 24 12:08:32 EDT 2019


New submission from Josh Rosenberg <shadowranger+python at gmail.com>:

TextIOWrapper explicitly sets the `telling` flag to 0 when .__next__ ( textiowrapper_iternext ) is called ( https://github.com/python/cpython/blob/3.7/Modules/_io/textio.c#L2974 ), e.g. during standard for loops over the file of this form, trying to call tell raises an exception:

    with open(filename) as myfile:
        for line in myfile:
            myfile.tell()

which raises:

    OSError: telling position disabled by next() call

while the effectively equivalent:

    with open(filename) as myfile:
        for line in iter(myfile.readline, ''):
            myfile.tell()

works fine.

The implementation of __next__ and readline is almost identical (__next__ is calling readline and handling the EOF sentinel per the iterator protocol, that's all). Given they're implemented identically, I see no reason aside from nannying (discouraging slow operations like seek/tell during iteration by forbidding them on the most convenient means of iterating) to forbid tell after beginning iteration, but not after readline. Given the general Python philosophy of "we're all adults here", I don't see nannying as a good reason, which leaves the performance benefit of avoiding snapshotting as the only compelling reason to do this.

But the performance benefit is trivial; in local tests, the savings from avoiding that work is barely noticeable, per ipython microbenchmarks (on 3.7.2 Linux x86-64):

    >>> %%timeit -r5 from collections import deque; f = open('American-english.txt'); consume = deque(maxlen=0).extend
    ... f.seek(0)
    ... consume(iter(f.readline, ''))
    ...
    ...
    15.8 ms ± 38.4 μs per loop (mean ± std. dev. of 5 runs, 100 loops each)

    >>> %%timeit -r5 from collections import deque; f = open('American-english.txt'); consume = deque(maxlen=0).extend
    ... f.seek(0)
    ... next(f)  # Triggers optimization for all future read_chunk calls
    ... consume(iter(f.readline, ''))  # Otherwise iterated identically
    ...
    ...
    15.7 ms ± 98.5 μs per loop (mean ± std. dev. of 5 runs, 100 loops each)

The two blocks are identical except that the second one explicitly advances the file one line at the beginning with next(f) to flip `telling` to 0 so future calls to readline don't involve the snapshotting code in textiowrapper_read_chunk.

Calling consume(f) would drop the time to 9.86 ms, but that's saying more about the optimization of the raw iterator protocol over method calls than it is about the `telling` optimization; I used two arg iter in both cases to keep the code paths as similar as possible so the `telling`.

For reference, the input file was 931708 bytes (931467 characters thanks to a few scattered non-ASCII UTF-8 characters), 98569 lines long.

Presumably, the speed difference of 0.1 ms can be chalked up to the telling optimization, so removing it would increase the cost of normal iteration from 9.86 ms to 9.96 ms. That seems de minimis to my mind, in the context of text oriented I/O.

Given that, it seems like triggering this optimization via __next__ should be dropped; it's a microoptimization at best, that's mostly irrelevant compared to the overhead of text-oriented I/O, and it introduces undocumented limitations on the use of TextIOWrapper.

The changes would be to remove all use of the `telling` variable, and (if we want to keep the optimization for unseekable files, where at least no functionality is lost by having it), change the two tests in textiowrapper_read_chunk to test `seekable` in its place. Or we drop the optimization entirely and save 50+ lines of code that provide a fairly tiny benefit in any event.

----------
components: IO, Library (Lib)
messages: 343402
nosy: josh.r
priority: normal
severity: normal
status: open
title: Iterating a text file by line should not implicitly disable tell
versions: Python 3.7, Python 3.8

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue37036>
_______________________________________


More information about the New-bugs-announce mailing list