Re: [Python-ideas] Iterating non-newline-separated files should be easier

On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i@gmail.com> wrote:
The draft PEP discusses this. I think it would be more consistent to translate for \0, just like \r and \r\n. For the your script, there is no reason to pass newline=nl to the stdout replacement. The only effect that has on output is \n replacement, which you don't want. And if we removed that effect from the proposal, it would have no effect at all on output, so why pass it? Do you have a use case where you need to pass a non-standard newline to a text file/stream, but don't want newline replacement? Or is it just a matter of avoiding confusion if people accidentally pass it for stdout when they didn't want it?
My original code works as is in this case i.e., *end=nl is still necessary*.
I didn't notice that you passed it for stdout as well--as I explained above, you don't need it, and shouldn't do it. As a side note, I think it might have been a better design to have separate arguments for input newline, output newline, and universal newlines mode, instead of cramming them all into one argument; for some simple cases the current design makes things a little less verbose, but it gets in the way for more complex cases, even today with \r or \r\n. However, I don't think that needs to be changed as part of this proposal. It also might be nice to have a full set of PYTHONIOFOO env variables rather than just PYTHONIOENCODING, but again, I don't think that needs to be part of this proposal. And likewise for Nick Coghlan's rewrap method proposal on TextIOWrapper and maybe BufferedFoo.
Yes, I assumed that, I just wanted to make that clear. My point is that if there isn't already an ecosystem of tools that do so on Windows, or a recommended answer from Microsoft, we don't need to fit into existing practices here. (Actually, there _is_ a recommended answer from Microsoft, but it's "don't send encoded filenames over a binary stream, send them as an array of UTF-16 strings over PowerShell cmdlet typed pipes"--and, more generally, "don't use any ANSI interfaces except for backward compatibility reasons".) At any rate, if the filenames-over-pipes encoding problem exists on Windows, and if it's solvable, it's still outside the scope of this proposal, unless you think the documentation needs a completely worked example that shows how to interact with some Windows tool, alongside one for interacting with find -print0 on Unix. (And I don't think it does. If we want a Windows example, resource compiler string input files, which are \0-terminated UTF-16, probably serve better.)
Yes, and broken in a way that people cannot easily work around except by using the UTF-16 interfaces. That's been Microsoft's recommended answer to the problem since NT 3.5, Win 95, and MSVCRT 3: if you want to handle all filenames, use _wmain, _wfopen, etc.--or, better, use CreateFileW instead of fopen. They never really addressed the issue of passing filenames between command-line tools at all, until PowerShell, where you pass them as a list of UTF-16 strings rather than a stream of newline-separated encoded bytes. (As a side note, I have no idea how well Python works for writing PowerShell cmdlets, but I don't think that's relevant to the current proposal.)
Yes, I've run into such bugs in the past. It's even more fun when you're dealing with unterminated string with separate length interfaces. Fortunately, as far as I know, no such bugs affect reading and writing binary files, pipes, and sockets, so they don't affect us here.

Andrew Barnert <abarnert@yahoo.com> writes:
I read the [draft]. No translation is a better choice here. Otherwise (at the very least) it breaks `find -print0` use case. [draft] http://bugs.python.org/file36008/pep-newline.txt Simple things should be simple (i.e., no translation unless special case): - binary file -- a stream of bytes: no structure, no translation on read/write - text file -- a stream of Unicode codepoints - file with fixed-length chunks: for chunk in iter(partial(file.read, chunksize), EOF): pass - file with variable-length records (aka lines) which end with a separator or EOF: no translation, no escaping (no embed separators): for line in file: pass or line = file.readline() # next(file) newline in {None, '', '\r', '\r\n'} is a (very important) special case that represents the complicated legacy behavior for text files. newline='\0' (like '\n') should be a *much simpler* case: no translation on read/write, no escaping (no embed '\0', each '\0' in the stream is a separator). newline='\0' is simple to explain: readline/next return everything until the next '\0' (including it) or EOF. It is simple to implement - no translation is required. readline(keep_end=True) keyword-only parameter and/or chomp()-like method could be added to simplify removing a trailing newline. newline in {"\N{NEL}", "\n\n", "\r\r", "\n\r"} behave like newline="\n" i.e., no translation. New *docs for writing text files*: When writing output to the stream: - if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep - if newline is '\r' or '\r\n', any '\n' characters written are translated to the given string - no translation takes place for any other newline value. The docs for binary files are simpler: No translation takes place for any newline value. The line terminator is newline parameter (default is b'\n'). The new *docs for reading text files*: When reading input from the stream: - if newline is None, universal newlines mode is enabled: lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller - if newline is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated - if newline is any other value, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated. The new behavior being more powerful is no more complex than the old one https://docs.python.org/3.4/library/io.html#io.TextIOWrapper Backwards compatibility is preserved except that newline parameter accepts more values.
Keep in mind, I expect that newline='\0' does *not* translate '\n' to '\0'. If you remove newline=nl then embed \n might be corrupted i.e., it breaks `find -print0` use-case. Both newline=nl for stdout and end=nl are required here. Though (optionally) it would be nice to change `print()` so that it would use `end=file.newline or '\n'` by default instead. There is also line_buffering parameter. From the docs: If line_buffering is True, flush() is implied when a call to write contains a newline character. i.e., you might also need newline=nl to flush() the stream in time. For example, the absense of the flush() call on newline may lead to a deadlock if subprocess module is used to implement pexpect-like behavior. There are corresponding Python issues: - text mode http://bugs.python.org/issue21332 : add line_buffering=True if bufsize=1, to avoid a deadlock (regression from Python 2 behavior) - binary mode http://bugs.python.org/issue21471 : implement line_buffering=True behavior for binary files when bufsize=1
Do you have a use case where you need to pass a non-standard newline to a text file/stream, but don't want newline replacement?
`find -print0` use case that my code implements above.
Or is it just a matter of avoiding confusion if people accidentally pass it for stdout when they didn't want it?
See the explanation above that starts with "Simple things should be simple."
Both newline=nl and end=nl are needed because I assume that there is no newline translation in newline='\0' case. See the explanation above. Here's the same code for context: sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) for line in SystemTextStream(sys.stdin.detach(), newline=nl): print(transform_filename(line.rstrip(nl)), end=nl) [2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html
Usually different objects are used for input and output i.e., a single newline parameter allows input newlines to be different from output newlines. The newline behavior for reading and writing is different but it is closely related. Having two parameters wouldn't make the documentation simpler. Separate parameters might be useful if the same file object is used for reading and writing *and* input/output newlines are different from each other. But I don't think it is worth it to complicate the common case (separate objects). -- Akira

On Thursday, July 24, 2014 2:08 AM, Akira Li <4kir4.1i@gmail.com> wrote:
No it doesn't. The only reason it breaks your code is that you add newline='\0' to your stdout wrapper as well as your stdin wrapper. If you just passed '', it would not do anything. And this is exactly parallel with the existing case with, e.g., trying to pass through a classic-Mac file full of '\r'-delimited strings that might contain embedded '\n' characters that you don't want to translate. As I've said before, I don't really like the design for '\r' and '\r\n', or the fact that three separate notions (universal-newlines flag, line ending for readline, and output translation for write) are all conflated into one idea and crammed into one parameter, but I think it's probably too late and too radical to change that. (It's less of an issue for binary files, because binary files can't take a newline parameter at all today, and because "no output translation" has been part of the definition of what "binary file" means all the way back to Python 1.x.)
Backwards compatibility is preserved except that newline parameter accepts more values.
The same is true with the draft proposal. You've basically copied the exact same thing, except for what happens on output for newlines other than None, '', '\n', '\r', and '\r\n' in text files. Since that case cannot arise today, there are no backward compatibility issues. Your version is only a small change to the documentation and a small change to the code, but my version is an even smaller change to the documentation and no change to the code, so you can't argue this from a conservative point of view.
No, it's only corrupted if you _pass_ newline=nl. If you instead passed, e.g., newline='', nothing could possibly corrupted.
i.e., it
That might be a nice change; I'll mention it in the next draft. But I think it's better to keep the changes as small and conservative as possible, so unless there's an upswell of support for it, I think anything that isn't actually necessary to solving the problem should be left out.
The way this is actually defined seems broken to me; IIRC (I'll check the code later) it flushes on any '\r', and on any translated '\n'. So, it's doing the wrong thing with '\r' in most modes, and with '\n' in '' mode on non-Unix systems. So my thought was, just leave it broken. But now that I think about it, the existing code can only flush excessively, never insufficiently, and that's probably a property worth preserving. So maybe there _is_ a reason to pass newline for output without translation after all. In other words, the parameter may actually conflate _four_ things, not just three... I'll need to think this through (and reread the code) this weekend; thanks for bringing it up.
I still don't understand your point here, and just repeating it isn't helping. You're making simple things _less_ simple than they are in the draft, requiring slightly more change to the documentation and to the code and slightly more for people to understand just to allow them to pass an unnecessary parameter. That doesn't sound like an argument from simplicity to me. But line_buffering definitely might be a good argument, in which case it doesn't matter how good this one is.

On 26 Jul 2014 04:33, "Andrew Barnert" <abarnert@yahoo.com.dmarc.invalid> wrote:
It's potentially still worth spelling out that idea as a Rejected Alternative in the PEP. A draft design that separates them may help clarify the concepts being conflated more effectively than simply describing them, even if your own pragmatic assessment is "too much pain for not enough gain". Cheers, Nick.

Nick Coghlan <ncoghlan@gmail.com> writes:
It can't be in the rejected ideas because it is the current behavior for io.TextIOWrapper(newline=..) and it will never change (in Python 3) due to backward compatibility. As I understand Andrew doesn't like that *newline* parameter does too much: - *newline* parameter turns on/off universal newline mode - it may specify the line separator e.g., newline='\r' - it specifies whether newline translation happens e.g., newline='' turns it off - together with *line_buffering*, it may enable flush() if newline is written It is unrelated to my proposal [1] that shouldn't change the old behavior if newline in {None, '', '\n', '\r', '\r\n'}. [1] http://bugs.python.org/issue1152248#msg224016 -- Akira

On Jul 25, 2014, at 19:24, Akira Li <4kir4.1i@gmail.com> wrote:
That's exactly why changing it would be a "rejected idea". It certainly doesn't hurt to document the fact that we thought about it and decided not to change it for backward compatibility reasons.
Exactly. And the fourth one only indirectly; "newline" flushing doesn't exactly mean _either_ of "\n" or the newline argument. And the related-but-definitely-not-the-same newlines attribute makes it even more confusing. (I've found bug reports with both Guido and Nick confused into thinking that newline was available as an attribute after construction; what hope do the rest of us have?) But the reality is, it rarely affects real-life programs, so it's definitely not worth breaking compatibility over. And it's still a whole lot cleaner than the 2.x design despite having a lot more details to deal with.

I've added a patch that demonstrates "no translation" for alternative newlines behavior http://bugs.python.org/issue1152248#msg224016 Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> writes:
I won't repeat it several times but as you've already found out newline='\0' for stdout (at the very least) can be useful for line_buffering=True behavior. ...
Yes. I've found at least one issue http://bugs.python.org/issue22069
-- Akira

On Jul 25, 2014, at 19:13, Akira Li <4kir4.1i@gmail.com> wrote:
I've added a patch that demonstrates "no translation" for alternative newlines behavior http://bugs.python.org/issue1152248#msg224016
Having taken a better look at the line buffering code, I now agree with you that this is necessary; otherwise we'd have to make a much bigger change to the implementation (which I don't think we want). When I update the draft PEP I'll change that and add a rationale (this also makes the rationale for "no translation for binary files" and for "only readnl is exposed, not writenl" a lot simpler). I'll also change it in my C patch (which I hope to be able to clean up and upload this weekend).

Andrew Barnert <abarnert@yahoo.com> writes:
I read the [draft]. No translation is a better choice here. Otherwise (at the very least) it breaks `find -print0` use case. [draft] http://bugs.python.org/file36008/pep-newline.txt Simple things should be simple (i.e., no translation unless special case): - binary file -- a stream of bytes: no structure, no translation on read/write - text file -- a stream of Unicode codepoints - file with fixed-length chunks: for chunk in iter(partial(file.read, chunksize), EOF): pass - file with variable-length records (aka lines) which end with a separator or EOF: no translation, no escaping (no embed separators): for line in file: pass or line = file.readline() # next(file) newline in {None, '', '\r', '\r\n'} is a (very important) special case that represents the complicated legacy behavior for text files. newline='\0' (like '\n') should be a *much simpler* case: no translation on read/write, no escaping (no embed '\0', each '\0' in the stream is a separator). newline='\0' is simple to explain: readline/next return everything until the next '\0' (including it) or EOF. It is simple to implement - no translation is required. readline(keep_end=True) keyword-only parameter and/or chomp()-like method could be added to simplify removing a trailing newline. newline in {"\N{NEL}", "\n\n", "\r\r", "\n\r"} behave like newline="\n" i.e., no translation. New *docs for writing text files*: When writing output to the stream: - if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep - if newline is '\r' or '\r\n', any '\n' characters written are translated to the given string - no translation takes place for any other newline value. The docs for binary files are simpler: No translation takes place for any newline value. The line terminator is newline parameter (default is b'\n'). The new *docs for reading text files*: When reading input from the stream: - if newline is None, universal newlines mode is enabled: lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller - if newline is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated - if newline is any other value, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated. The new behavior being more powerful is no more complex than the old one https://docs.python.org/3.4/library/io.html#io.TextIOWrapper Backwards compatibility is preserved except that newline parameter accepts more values.
Keep in mind, I expect that newline='\0' does *not* translate '\n' to '\0'. If you remove newline=nl then embed \n might be corrupted i.e., it breaks `find -print0` use-case. Both newline=nl for stdout and end=nl are required here. Though (optionally) it would be nice to change `print()` so that it would use `end=file.newline or '\n'` by default instead. There is also line_buffering parameter. From the docs: If line_buffering is True, flush() is implied when a call to write contains a newline character. i.e., you might also need newline=nl to flush() the stream in time. For example, the absense of the flush() call on newline may lead to a deadlock if subprocess module is used to implement pexpect-like behavior. There are corresponding Python issues: - text mode http://bugs.python.org/issue21332 : add line_buffering=True if bufsize=1, to avoid a deadlock (regression from Python 2 behavior) - binary mode http://bugs.python.org/issue21471 : implement line_buffering=True behavior for binary files when bufsize=1
Do you have a use case where you need to pass a non-standard newline to a text file/stream, but don't want newline replacement?
`find -print0` use case that my code implements above.
Or is it just a matter of avoiding confusion if people accidentally pass it for stdout when they didn't want it?
See the explanation above that starts with "Simple things should be simple."
Both newline=nl and end=nl are needed because I assume that there is no newline translation in newline='\0' case. See the explanation above. Here's the same code for context: sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) for line in SystemTextStream(sys.stdin.detach(), newline=nl): print(transform_filename(line.rstrip(nl)), end=nl) [2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html
Usually different objects are used for input and output i.e., a single newline parameter allows input newlines to be different from output newlines. The newline behavior for reading and writing is different but it is closely related. Having two parameters wouldn't make the documentation simpler. Separate parameters might be useful if the same file object is used for reading and writing *and* input/output newlines are different from each other. But I don't think it is worth it to complicate the common case (separate objects). -- Akira

On Thursday, July 24, 2014 2:08 AM, Akira Li <4kir4.1i@gmail.com> wrote:
No it doesn't. The only reason it breaks your code is that you add newline='\0' to your stdout wrapper as well as your stdin wrapper. If you just passed '', it would not do anything. And this is exactly parallel with the existing case with, e.g., trying to pass through a classic-Mac file full of '\r'-delimited strings that might contain embedded '\n' characters that you don't want to translate. As I've said before, I don't really like the design for '\r' and '\r\n', or the fact that three separate notions (universal-newlines flag, line ending for readline, and output translation for write) are all conflated into one idea and crammed into one parameter, but I think it's probably too late and too radical to change that. (It's less of an issue for binary files, because binary files can't take a newline parameter at all today, and because "no output translation" has been part of the definition of what "binary file" means all the way back to Python 1.x.)
Backwards compatibility is preserved except that newline parameter accepts more values.
The same is true with the draft proposal. You've basically copied the exact same thing, except for what happens on output for newlines other than None, '', '\n', '\r', and '\r\n' in text files. Since that case cannot arise today, there are no backward compatibility issues. Your version is only a small change to the documentation and a small change to the code, but my version is an even smaller change to the documentation and no change to the code, so you can't argue this from a conservative point of view.
No, it's only corrupted if you _pass_ newline=nl. If you instead passed, e.g., newline='', nothing could possibly corrupted.
i.e., it
That might be a nice change; I'll mention it in the next draft. But I think it's better to keep the changes as small and conservative as possible, so unless there's an upswell of support for it, I think anything that isn't actually necessary to solving the problem should be left out.
The way this is actually defined seems broken to me; IIRC (I'll check the code later) it flushes on any '\r', and on any translated '\n'. So, it's doing the wrong thing with '\r' in most modes, and with '\n' in '' mode on non-Unix systems. So my thought was, just leave it broken. But now that I think about it, the existing code can only flush excessively, never insufficiently, and that's probably a property worth preserving. So maybe there _is_ a reason to pass newline for output without translation after all. In other words, the parameter may actually conflate _four_ things, not just three... I'll need to think this through (and reread the code) this weekend; thanks for bringing it up.
I still don't understand your point here, and just repeating it isn't helping. You're making simple things _less_ simple than they are in the draft, requiring slightly more change to the documentation and to the code and slightly more for people to understand just to allow them to pass an unnecessary parameter. That doesn't sound like an argument from simplicity to me. But line_buffering definitely might be a good argument, in which case it doesn't matter how good this one is.

On 26 Jul 2014 04:33, "Andrew Barnert" <abarnert@yahoo.com.dmarc.invalid> wrote:
It's potentially still worth spelling out that idea as a Rejected Alternative in the PEP. A draft design that separates them may help clarify the concepts being conflated more effectively than simply describing them, even if your own pragmatic assessment is "too much pain for not enough gain". Cheers, Nick.

Nick Coghlan <ncoghlan@gmail.com> writes:
It can't be in the rejected ideas because it is the current behavior for io.TextIOWrapper(newline=..) and it will never change (in Python 3) due to backward compatibility. As I understand Andrew doesn't like that *newline* parameter does too much: - *newline* parameter turns on/off universal newline mode - it may specify the line separator e.g., newline='\r' - it specifies whether newline translation happens e.g., newline='' turns it off - together with *line_buffering*, it may enable flush() if newline is written It is unrelated to my proposal [1] that shouldn't change the old behavior if newline in {None, '', '\n', '\r', '\r\n'}. [1] http://bugs.python.org/issue1152248#msg224016 -- Akira

On Jul 25, 2014, at 19:24, Akira Li <4kir4.1i@gmail.com> wrote:
That's exactly why changing it would be a "rejected idea". It certainly doesn't hurt to document the fact that we thought about it and decided not to change it for backward compatibility reasons.
Exactly. And the fourth one only indirectly; "newline" flushing doesn't exactly mean _either_ of "\n" or the newline argument. And the related-but-definitely-not-the-same newlines attribute makes it even more confusing. (I've found bug reports with both Guido and Nick confused into thinking that newline was available as an attribute after construction; what hope do the rest of us have?) But the reality is, it rarely affects real-life programs, so it's definitely not worth breaking compatibility over. And it's still a whole lot cleaner than the 2.x design despite having a lot more details to deal with.

I've added a patch that demonstrates "no translation" for alternative newlines behavior http://bugs.python.org/issue1152248#msg224016 Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> writes:
I won't repeat it several times but as you've already found out newline='\0' for stdout (at the very least) can be useful for line_buffering=True behavior. ...
Yes. I've found at least one issue http://bugs.python.org/issue22069
-- Akira

On Jul 25, 2014, at 19:13, Akira Li <4kir4.1i@gmail.com> wrote:
I've added a patch that demonstrates "no translation" for alternative newlines behavior http://bugs.python.org/issue1152248#msg224016
Having taken a better look at the line buffering code, I now agree with you that this is necessary; otherwise we'd have to make a much bigger change to the implementation (which I don't think we want). When I update the draft PEP I'll change that and add a rationale (this also makes the rationale for "no translation for binary files" and for "only readnl is exposed, not writenl" a lot simpler). I'll also change it in my C patch (which I hope to be able to clean up and upload this weekend).
participants (3)
-
Akira Li
-
Andrew Barnert
-
Nick Coghlan