[Python-ideas] Iterating non-newline-separated files should be easier

Sun Jul 20 02:57:14 CEST 2014

On Saturday, July 19, 2014 4:49 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

>On 20 Jul 2014 09:28, "Andrew Barnert" <abarnert at yahoo.com> wrote:

>> In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing—it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem.

>I would find adding NULL to the potential newline set significantly less objectionable than opening it up to arbitrary character sequences.

>Adding a single possible newline character is a much simpler change, and one likely to have far fewer odd consequences. This is especially so if specifying NULL as the line separator is only permitted for files opened in binary mode.

But newline is only permitted for text mode. Are you suggesting that we add newline to binary mode, but the only allowed values are NULL (current behavior) and \0, while on text files the list of allowed values stays the same as today?

Also, would you want the same semantics for newline='\0' on binary files that newline='\r' has on text files (including newline remapping on write)?

And I'm still not sure why you think this shouldn't be allowed in text mode in the first place (especially given that you suggested the same thing for text files _only_ a few years ago).

The output of file is a list of newline-separated or \0-separated filenames, in the filesystem's encoding. Why should I be able to handle the first as a text file, but have to handle the second as a binary file and then manually decode each line?

You could argue that file -0 isn't really separating Unicode filenames with U+0000, but separating UTF-8 or Latin-1 or whatever filenames with \x00, and it's just a coincidence that they happen to match up. But it really isn't just a coincidence; it was an intentional design decision for Unicode (and UTF-8, and Latin-1) that the ASCII control characters map in the obvious way, and one that many tools and scripts take advantage of, so why shouldn't tools and scripts written in Python be able to take advantage of it?