On Sat, Apr 25, 2020 at 10:41 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Thu, Apr 23, 2020 at 09:10:16PM -0400, Nathan Schneider wrote:

> How, for example, to collate lines from 3 potentially large files while
> ensuring they match in length (without an external dependency)? The best I
> can think of is rather ugly:
>
> with open('a.txt') as a, open('b.txt') as b, open('c.txt') as c:
>     for lineA, lineB, lineC in zip(a, b, c):
>         do_something_with(lineA, lineB, lineC)
>     assert next(a, None) is None
>     assert next(b, None) is None
>     assert next(c, None) is None
>
> Changing the zip() call to zip(aF, bF, cF, strict=True) would remove the
> necessity of the asserts.

I think that the "correct" (simplest, easiest, most obvious, most
flexible) way is:

    with open('a.txt') as a, open('b.txt') as b, open('c.txt') as c:
        for lineA, lineB, lineC in zip_longest(a, b, c, fillvalue=''):
            do_something_with(lineA, lineB, lineC)

and have `do_something_with` handle the empty string case, either by
raising, or more likely, doing something sensible like treating it as a
blank line rather than dying with an exception.


This is the sentinel pattern with zip_longest() rather than next(). Sure, it works, but I'm not sure it's the most obvious—conceptually zip_longest() is saying "I want to have as many items as the max of the iterables", but then the loop short-circuits if the fillvalue is used. More natural to say "I expect these iterables to have the same length from the beginning" (if that is what the application demands).
 
Especially if the files differ in how many newlines they end with. E.g.
file a.txt and c.txt end with a newline, but b.txt ends without one, or
ends with an extra blank line at the end.


Well, this depends on the application and the assumptions about where the files come from.

I can see that zip_longest() will technically work with the sentinel pattern. If there is consensus that it should be a builtin, I might start using this instead of zip() with separate checks. But to enforce length-matching, it still requires an extra check, plus a decision about what the sentinel value should be (for direct file reading '' is fine, but not necessarily for other iterables like collections or file-loading wrappers). IOW, the pattern has some conceptual and code overhead as a solution to "make sure the number of items matches".

Given that length-matching is a need that many of us frequently encounter, adding strict=True to zip() seems like a very useful and intuitive option to have, without breaking any existing code.

Nathan