[Python-ideas] Iterating non-newline-separated files should be easier

Wed Jul 23 06:24:12 CEST 2014

On Jul 21, 2014, at 0:04, Paul Moore <p.f.moore at gmail.com> wrote:

> On 21 July 2014 01:41, Andrew Barnert <abarnert at yahoo.com.dmarc.invalid> wrote:
>> OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt
> 
> As a suggestion, how about adding an example of a simple nul-separated
> filename filter - the sort of thing that could go in a find -print0 |
> xxx | xargs -0 pipeline? If I understand it, that's one of the key
> motivating examples for this change, so seeing how it's done would be
> a great help.
> 
> Here's the sort of thing I mean, written for newline-separated files:
> 
> import sys
> 
> def process(filename):
>    """Trivial example"""
>    return filename.lower()
> 
> if __name__ == '__main__':
> 
>    for filename in sys.stdin:
>        filename = process(filename)
>        print(filename)

for file in io.TextIOWrapper(sys.stdin.buffer, encoding=sys.stdin.encoding, errors=sys.stdin.errors, newline='\0'):
    filename = process(filename.rstrip('\0'))
    print(filename)

I assume you wanted an rstrip('\n') in the original, so I did the equivalent here.

If you want to pipe the result to another -0 tool, you also need to add end='\0' to the print, of course.

If we had Nick Coghlan's separate idea of adding rewrap methods to the stream classes (not part of this proposal, but I would be happy to have it), it would be even simpler:

for file in sys.stdin.rewrap(newline='\0'):
    filename = process(filename.rstrip('\0'))
    print(filename)

Anyway, this isn't perfect if, e.g., you might have illegal-as-UTF8 Latin-1 filenames hiding in your UTF8 filesystem, but neither is your code; in fact, this does exactly the same thing, except that it takes \0 terminators (so it can handle filenames with embedded newlines, or pipelines that use -print0 just because they can't be sure which tools in the chain can handle spaces).

It's obviously a little more complicated than your code, but that's to be expected; it's a lot simpler than anything we can write today. (And it runs at the same speed of your code instead of 2x slower or worse.)

> This is also an example of why I'm struggling to understand how an
> open() parameter "solves all the cases". There's no explicit open()
> call here, so how do you specify the record separator? Seeing how you
> propose this would work would be really helpful to me.

The open function is just a shortcut to constructing a stack of io classes; you can always construct them manually. It would be nice if some cases of that were made a little easier (again, see Nick's proposal above), but it's easy enough to live with.