Re: [Python-ideas] Iterating non-newline-separated files should be easier
On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i@gmail.com> wrote:
Andrew Barnert <abarnert@yahoo.com> writes:
On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i@gmail.com> wrote:
Paul Moore <p.f.moore@gmail.com> writes:
On 21 July 2014 01:41, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt
As a suggestion, how about adding an example of a simple nul-separated filename filter - the sort of thing that could go in a find -print0 | xxx | xargs -0 pipeline? If I understand it, that's one of the key motivating examples for this change, so seeing how it's done would be a great help.
Here's the sort of thing I mean, written for newline-separated files:
import sys
def process(filename): """Trivial example""" return filename.lower()
if __name__ == '__main__':
for filename in sys.stdin: filename = process(filename) print(filename)
This is also an example of why I'm struggling to understand how an open() parameter "solves all the cases". There's no explicit open() call here, so how do you specify the record separator? Seeing how you propose this would work would be really helpful to me.
`find -print0 | ./tr-filename -0 | xargs -0` example implies that you can replace `sys.std*` streams without worrying about preserving `sys.__std*__` streams:
#!/usr/bin/env python import io import re import sys from pathlib import Path
def transform_filename(filename: str) -> str: # example """Normalize whitespace in basename.""" path = Path(filename) new_path = path.with_name(re.sub(r'\s+', ' ', path.name)) path.replace(new_path) # rename on disk if necessary return str(new_path)
def SystemTextStream(bytes_stream, **kwargs): encoding = sys.getfilesystemencoding() return io.TextIOWrapper(bytes_stream, encoding=encoding, errors='surrogateescape' if encoding != 'mbcs' else 'strict', **kwargs)
nl = '\0' if '-0' in sys.argv else None sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) for line in SystemTextStream(sys.stdin.detach(), newline=nl): print(transform_filename(line.rstrip(nl)), end=nl)
Nice, much more complete example than mine. I just tried to handle as many edge cases as the original he asked about, but you handle everything.
io.TextIOWrapper() plays the role of open() in this case. The code assumes that `newline` parameter accepts '\0'.
The example function handles Unicode whitespace to demonstrate why opaque bytes-based cookies can't be used to represent filenames in this case even on POSIX, though which characters are recognized depends on sys.getfilesystemencoding().
Note:
- `end=nl` is necessary because `print()` prints '\n' by default -- it does not use `file.newline`
Actually, yes it does. Or, rather, print pastes on a '\n', but sys.stdout.write translates any '\n' characters to sys.stdout.writenl (a private variable that's initialized from the newline argument at construction time if it's anything other than None or '').
You are right. I've stopped reading the source for print() function at `PyFile_WriteString("\n", file);` line assuming that "\n" is not translated if newline="\0". But the current behaviour if "\0" were in "the other legal values" category (like "\r") would be to translate "\n" [1]:
When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.
[1] https://docs.python.org/3/library/io.html#io.TextIOWrapper
Example:
$ ./python -c 'import sys, io; sys.stdout=io.TextIOWrapper(sys.stdout.detach(), newline="\r\n"); sys.stdout.write("\n\r\r\n")'| xxd 0000000: 0d0a 0d0d 0d0a ......
"\n" is translated to b"\r\n" here and "\r" is left untouched (b"\r").
In order to newline="\0" case to work, it should behave similar to newline='' or newline='\n' case instead i.e., no translation should take place, to avoid corrupting embed "\n\r" characters.
The draft PEP discusses this. I think it would be more consistent to translate for \0, just like \r and \r\n. For the your script, there is no reason to pass newline=nl to the stdout replacement. The only effect that has on output is \n replacement, which you don't want. And if we removed that effect from the proposal, it would have no effect at all on output, so why pass it? Do you have a use case where you need to pass a non-standard newline to a text file/stream, but don't want newline replacement? Or is it just a matter of avoiding confusion if people accidentally pass it for stdout when they didn't want it?
My original code works as is in this case i.e., *end=nl is still necessary*.
But of course that's the newline argument to sys.stdout, and you only changed sys.stdin, so you do need end=nl anyway. (And you wouldn't want output translation here anyway, because that could also translate \n' characters in the middle of a line, re-creating the same problem we're trying to avoid...)
But it uses sys.stdout.newline, not sys.stdin.newline.
The code affects *both* sys.stdout/sys.stdin. Look [2]:
I didn't notice that you passed it for stdout as well--as I explained above, you don't need it, and shouldn't do it. As a side note, I think it might have been a better design to have separate arguments for input newline, output newline, and universal newlines mode, instead of cramming them all into one argument; for some simple cases the current design makes things a little less verbose, but it gets in the way for more complex cases, even today with \r or \r\n. However, I don't think that needs to be changed as part of this proposal. It also might be nice to have a full set of PYTHONIOFOO env variables rather than just PYTHONIOENCODING, but again, I don't think that needs to be part of this proposal. And likewise for Nick Coghlan's rewrap method proposal on TextIOWrapper and maybe BufferedFoo.
sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) for line in SystemTextStream(sys.stdin.detach(), newline=nl): print(transform_filename(line.rstrip(nl)), end=nl)
[2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html
- SystemTextStream() handles undecodable in the current locale filenames i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C) - undecodable filenames are not supported on Windows. It is not clear how to pass an undecodable filename via a pipe on Windows -- perhaps `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It assumes that the short path exists and it is always encodable using mbcs. If we can control all parts of the pipeline *and* Windows API uses proper utf-16 (not ucs-2) then utf-8 can be used to pass filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be tried e.g., https://github.com/Drekin/win-unicode-console
First, don't both the Win32 APIs and the POSIX-ish layer in msvcrt on top of it guarantee that you can never get such unencodable filenames (sometimes by just pretending the file doesn't exist, but if possible by having the filesystem map it to something valid, unique, and persistent for this session, usually the short name)? Second, trying to solve this implies that you have some other native (as opposed to Cygwin) tool that passes or accepts such filenames over simple pipes (as opposed to PowerShell typed ones). Are there any? What does, say, mingw's find do with invalid filenames if it finds them?
In short: I don't know :)
To be clear, I'm talking about native Windows applications (not find/xargs on Cygwin). The goal is to process robustly *arbitrary* filenames on Windows via a pipe (SystemTextStream()) or network (bytes interface).
Yes, I assumed that, I just wanted to make that clear. My point is that if there isn't already an ecosystem of tools that do so on Windows, or a recommended answer from Microsoft, we don't need to fit into existing practices here. (Actually, there _is_ a recommended answer from Microsoft, but it's "don't send encoded filenames over a binary stream, send them as an array of UTF-16 strings over PowerShell cmdlet typed pipes"--and, more generally, "don't use any ANSI interfaces except for backward compatibility reasons".) At any rate, if the filenames-over-pipes encoding problem exists on Windows, and if it's solvable, it's still outside the scope of this proposal, unless you think the documentation needs a completely worked example that shows how to interact with some Windows tool, alongside one for interacting with find -print0 on Unix. (And I don't think it does. If we want a Windows example, resource compiler string input files, which are \0-terminated UTF-16, probably serve better.)
I know that (A)nsi API (and therefore "POSIX-ish layer" that uses narrow strings such main(), fopen(), fstream is broken e.g., Thai filenames on Greek computer [3].
Yes, and broken in a way that people cannot easily work around except by using the UTF-16 interfaces. That's been Microsoft's recommended answer to the problem since NT 3.5, Win 95, and MSVCRT 3: if you want to handle all filenames, use _wmain, _wfopen, etc.--or, better, use CreateFileW instead of fopen. They never really addressed the issue of passing filenames between command-line tools at all, until PowerShell, where you pass them as a list of UTF-16 strings rather than a stream of newline-separated encoded bytes. (As a side note, I have no idea how well Python works for writing PowerShell cmdlets, but I don't think that's relevant to the current proposal.)
Unicode (W) API should enforce utf-16 in principle since Windows 2000 [4]. But I expect ucs-2 shows its ugly head in many places due to bad programming practices (based on the common wrong assumption that Unicode == UTF-16 == UCS-2) and/or bugs that are not fixed due to MS' backwards compatibility policies in the past [5].
Yes, I've run into such bugs in the past. It's even more fun when you're dealing with unterminated string with separate length interfaces. Fortunately, as far as I know, no such bugs affect reading and writing binary files, pipes, and sockets, so they don't affect us here.
Andrew Barnert <abarnert@yahoo.com> writes:
On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i@gmail.com> wrote:
Andrew Barnert <abarnert@yahoo.com> writes:
On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i@gmail.com> wrote:
Paul Moore <p.f.moore@gmail.com> writes:
On 21 July 2014 01:41, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt
As a suggestion, how about adding an example of a simple nul-separated filename filter - the sort of thing that could go in a find -print0 | xxx | xargs -0 pipeline? If I understand it, that's one of the key motivating examples for this change, so seeing how it's done would be a great help.
`find -print0 | ./tr-filename -0 | xargs -0` example implies that you can replace `sys.std*` streams without worrying about preserving `sys.__std*__` streams:
#!/usr/bin/env python import io import re import sys from pathlib import Path
def transform_filename(filename: str) -> str: # example """Normalize whitespace in basename.""" path = Path(filename) new_path = path.with_name(re.sub(r'\s+', ' ', path.name)) path.replace(new_path) # rename on disk if necessary return str(new_path)
def SystemTextStream(bytes_stream, **kwargs): encoding = sys.getfilesystemencoding() return io.TextIOWrapper(bytes_stream, encoding=encoding, errors='surrogateescape' if encoding != 'mbcs' else 'strict', **kwargs)
nl = '\0' if '-0' in sys.argv else None sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) for line in SystemTextStream(sys.stdin.detach(), newline=nl): print(transform_filename(line.rstrip(nl)), end=nl)
Nice, much more complete example than mine. I just tried to handle as many edge cases as the original he asked about, but you handle everything.
io.TextIOWrapper() plays the role of open() in this case. The code assumes that `newline` parameter accepts '\0'.
The example function handles Unicode whitespace to demonstrate why opaque bytes-based cookies can't be used to represent filenames in this case even on POSIX, though which characters are recognized depends on sys.getfilesystemencoding().
Note:
- `end=nl` is necessary because `print()` prints '\n' by default -- it does not use `file.newline`
Actually, yes it does. Or, rather, print pastes on a '\n', but sys.stdout.write translates any '\n' characters to sys.stdout.writenl (a private variable that's initialized from the newline argument at construction time if it's anything other than None or '').
You are right. I've stopped reading the source for print() function at `PyFile_WriteString("\n", file);` line assuming that "\n" is not translated if newline="\0". But the current behaviour if "\0" were in "the other legal values" category (like "\r") would be to translate "\n" [1]:
When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.
[1] https://docs.python.org/3/library/io.html#io.TextIOWrapper
Example:
$ ./python -c 'import sys, io; sys.stdout=io.TextIOWrapper(sys.stdout.detach(), newline="\r\n"); sys.stdout.write("\n\r\r\n")'| xxd 0000000: 0d0a 0d0d 0d0a ......
"\n" is translated to b"\r\n" here and "\r" is left untouched (b"\r").
In order to newline="\0" case to work, it should behave similar to newline='' or newline='\n' case instead i.e., no translation should take place, to avoid corrupting embed "\n\r" characters.
The draft PEP discusses this. I think it would be more consistent to translate for \0, just like \r and \r\n.
I read the [draft]. No translation is a better choice here. Otherwise (at the very least) it breaks `find -print0` use case. [draft] http://bugs.python.org/file36008/pep-newline.txt Simple things should be simple (i.e., no translation unless special case): - binary file -- a stream of bytes: no structure, no translation on read/write - text file -- a stream of Unicode codepoints - file with fixed-length chunks: for chunk in iter(partial(file.read, chunksize), EOF): pass - file with variable-length records (aka lines) which end with a separator or EOF: no translation, no escaping (no embed separators): for line in file: pass or line = file.readline() # next(file) newline in {None, '', '\r', '\r\n'} is a (very important) special case that represents the complicated legacy behavior for text files. newline='\0' (like '\n') should be a *much simpler* case: no translation on read/write, no escaping (no embed '\0', each '\0' in the stream is a separator). newline='\0' is simple to explain: readline/next return everything until the next '\0' (including it) or EOF. It is simple to implement - no translation is required. readline(keep_end=True) keyword-only parameter and/or chomp()-like method could be added to simplify removing a trailing newline. newline in {"\N{NEL}", "\n\n", "\r\r", "\n\r"} behave like newline="\n" i.e., no translation. New *docs for writing text files*: When writing output to the stream: - if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep - if newline is '\r' or '\r\n', any '\n' characters written are translated to the given string - no translation takes place for any other newline value. The docs for binary files are simpler: No translation takes place for any newline value. The line terminator is newline parameter (default is b'\n'). The new *docs for reading text files*: When reading input from the stream: - if newline is None, universal newlines mode is enabled: lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller - if newline is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated - if newline is any other value, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated. The new behavior being more powerful is no more complex than the old one https://docs.python.org/3.4/library/io.html#io.TextIOWrapper Backwards compatibility is preserved except that newline parameter accepts more values.
For the your script, there is no reason to pass newline=nl to the stdout replacement. The only effect that has on output is \n replacement, which you don't want. And if we removed that effect from the proposal, it would have no effect at all on output, so why pass it?
Keep in mind, I expect that newline='\0' does *not* translate '\n' to '\0'. If you remove newline=nl then embed \n might be corrupted i.e., it breaks `find -print0` use-case. Both newline=nl for stdout and end=nl are required here. Though (optionally) it would be nice to change `print()` so that it would use `end=file.newline or '\n'` by default instead. There is also line_buffering parameter. From the docs: If line_buffering is True, flush() is implied when a call to write contains a newline character. i.e., you might also need newline=nl to flush() the stream in time. For example, the absense of the flush() call on newline may lead to a deadlock if subprocess module is used to implement pexpect-like behavior. There are corresponding Python issues: - text mode http://bugs.python.org/issue21332 : add line_buffering=True if bufsize=1, to avoid a deadlock (regression from Python 2 behavior) - binary mode http://bugs.python.org/issue21471 : implement line_buffering=True behavior for binary files when bufsize=1
Do you have a use case where you need to pass a non-standard newline to a text file/stream, but don't want newline replacement?
`find -print0` use case that my code implements above.
Or is it just a matter of avoiding confusion if people accidentally pass it for stdout when they didn't want it?
See the explanation above that starts with "Simple things should be simple."
My original code works as is in this case i.e., *end=nl is still necessary*.
But of course that's the newline argument to sys.stdout, and you only changed sys.stdin, so you do need end=nl anyway. (And you wouldn't want output translation here anyway, because that could also translate \n' characters in the middle of a line, re-creating the same problem we're trying to avoid...)
But it uses sys.stdout.newline, not sys.stdin.newline.
The code affects *both* sys.stdout/sys.stdin. Look [2]:
I didn't notice that you passed it for stdout as well--as I explained above, you don't need it, and shouldn't do it.
Both newline=nl and end=nl are needed because I assume that there is no newline translation in newline='\0' case. See the explanation above. Here's the same code for context: sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) for line in SystemTextStream(sys.stdin.detach(), newline=nl): print(transform_filename(line.rstrip(nl)), end=nl) [2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html
As a side note, I think it might have been a better design to have separate arguments for input newline, output newline, and universal newlines mode, instead of cramming them all into one argument; for some simple cases the current design makes things a little less verbose, but it gets in the way for more complex cases, even today with \r or \r\n. However, I don't think that needs to be changed as part of this proposal.
Usually different objects are used for input and output i.e., a single newline parameter allows input newlines to be different from output newlines. The newline behavior for reading and writing is different but it is closely related. Having two parameters wouldn't make the documentation simpler. Separate parameters might be useful if the same file object is used for reading and writing *and* input/output newlines are different from each other. But I don't think it is worth it to complicate the common case (separate objects). -- Akira
On Thursday, July 24, 2014 2:08 AM, Akira Li <4kir4.1i@gmail.com> wrote:
Andrew Barnert <abarnert@yahoo.com> writes:
On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i@gmail.com> wrote:
In order to newline="\0" case to work, it should behave
similar to newline='' or newline='\n' case instead i.e., no translation should take place, to avoid corrupting embed "\n\r" characters.
The draft PEP discusses this. I think it would be more consistent to translate for \0, just like \r and \r\n.
I read the [draft]. No translation is a better choice here. Otherwise
(at the very least) it breaks `find -print0` use case.
No it doesn't. The only reason it breaks your code is that you add newline='\0' to your stdout wrapper as well as your stdin wrapper. If you just passed '', it would not do anything. And this is exactly parallel with the existing case with, e.g., trying to pass through a classic-Mac file full of '\r'-delimited strings that might contain embedded '\n' characters that you don't want to translate. As I've said before, I don't really like the design for '\r' and '\r\n', or the fact that three separate notions (universal-newlines flag, line ending for readline, and output translation for write) are all conflated into one idea and crammed into one parameter, but I think it's probably too late and too radical to change that. (It's less of an issue for binary files, because binary files can't take a newline parameter at all today, and because "no output translation" has been part of the definition of what "binary file" means all the way back to Python 1.x.)
Backwards compatibility is preserved except that newline parameter accepts more values.
The same is true with the draft proposal. You've basically copied the exact same thing, except for what happens on output for newlines other than None, '', '\n', '\r', and '\r\n' in text files. Since that case cannot arise today, there are no backward compatibility issues. Your version is only a small change to the documentation and a small change to the code, but my version is an even smaller change to the documentation and no change to the code, so you can't argue this from a conservative point of view.
For the your script, there is no reason to pass newline=nl to the stdout replacement. The only effect that has on output is \n replacement, which you don't want. And if we removed that effect from the proposal, it would have no effect at all on output, so why pass it?
Keep in mind, I expect that newline='\0' does *not* translate '\n' to '\0'. If you remove newline=nl then embed \n might be corrupted
No, it's only corrupted if you _pass_ newline=nl. If you instead passed, e.g., newline='', nothing could possibly corrupted.
i.e., it
breaks `find -print0` use-case. Both newline=nl for stdout and end=nl are required here. Though (optionally) it would be nice to change `print()` so that it would use `end=file.newline or '\n'` by default instead.
That might be a nice change; I'll mention it in the next draft. But I think it's better to keep the changes as small and conservative as possible, so unless there's an upswell of support for it, I think anything that isn't actually necessary to solving the problem should be left out.
There is also line_buffering parameter. From the docs:
If line_buffering is True, flush() is implied when a call to write contains a newline character.
The way this is actually defined seems broken to me; IIRC (I'll check the code later) it flushes on any '\r', and on any translated '\n'. So, it's doing the wrong thing with '\r' in most modes, and with '\n' in '' mode on non-Unix systems. So my thought was, just leave it broken. But now that I think about it, the existing code can only flush excessively, never insufficiently, and that's probably a property worth preserving. So maybe there _is_ a reason to pass newline for output without translation after all. In other words, the parameter may actually conflate _four_ things, not just three... I'll need to think this through (and reread the code) this weekend; thanks for bringing it up.
Do you have a use case where you need to pass a non-standard newline to a text file/stream, but don't want newline replacement?
`find -print0` use case that my code implements above.
Or is it just a matter of avoiding confusion if people accidentally pass it for stdout when they didn't want it?
See the explanation above that starts with "Simple things should be simple."
I still don't understand your point here, and just repeating it isn't helping. You're making simple things _less_ simple than they are in the draft, requiring slightly more change to the documentation and to the code and slightly more for people to understand just to allow them to pass an unnecessary parameter. That doesn't sound like an argument from simplicity to me. But line_buffering definitely might be a good argument, in which case it doesn't matter how good this one is.
On 26 Jul 2014 04:33, "Andrew Barnert" <abarnert@yahoo.com.dmarc.invalid> wrote:
As I've said before, I don't really like the design for '\r' and '\r\n', or the fact that three separate notions (universal-newlines flag, line ending for readline, and output translation for write) are all conflated into one idea and crammed into one parameter, but I think it's probably too late and too radical to change that.
It's potentially still worth spelling out that idea as a Rejected Alternative in the PEP. A draft design that separates them may help clarify the concepts being conflated more effectively than simply describing them, even if your own pragmatic assessment is "too much pain for not enough gain". Cheers, Nick.
Nick Coghlan <ncoghlan@gmail.com> writes:
On 26 Jul 2014 04:33, "Andrew Barnert" <abarnert@yahoo.com.dmarc.invalid> wrote:
As I've said before, I don't really like the design for '\r' and '\r\n', or the fact that three separate notions (universal-newlines flag, line ending for readline, and output translation for write) are all conflated into one idea and crammed into one parameter, but I think it's probably too late and too radical to change that.
It's potentially still worth spelling out that idea as a Rejected Alternative in the PEP. A draft design that separates them may help clarify the concepts being conflated more effectively than simply describing them, even if your own pragmatic assessment is "too much pain for not enough gain".
It can't be in the rejected ideas because it is the current behavior for io.TextIOWrapper(newline=..) and it will never change (in Python 3) due to backward compatibility. As I understand Andrew doesn't like that *newline* parameter does too much: - *newline* parameter turns on/off universal newline mode - it may specify the line separator e.g., newline='\r' - it specifies whether newline translation happens e.g., newline='' turns it off - together with *line_buffering*, it may enable flush() if newline is written It is unrelated to my proposal [1] that shouldn't change the old behavior if newline in {None, '', '\n', '\r', '\r\n'}. [1] http://bugs.python.org/issue1152248#msg224016 -- Akira
On Jul 25, 2014, at 19:24, Akira Li <4kir4.1i@gmail.com> wrote:
Nick Coghlan <ncoghlan@gmail.com> writes:
On 26 Jul 2014 04:33, "Andrew Barnert" <abarnert@yahoo.com.dmarc.invalid> wrote:
As I've said before, I don't really like the design for '\r' and '\r\n', or the fact that three separate notions (universal-newlines flag, line ending for readline, and output translation for write) are all conflated into one idea and crammed into one parameter, but I think it's probably too late and too radical to change that.
It's potentially still worth spelling out that idea as a Rejected Alternative in the PEP. A draft design that separates them may help clarify the concepts being conflated more effectively than simply describing them, even if your own pragmatic assessment is "too much pain for not enough gain".
It can't be in the rejected ideas because it is the current behavior for io.TextIOWrapper(newline=..) and it will never change (in Python 3) due to backward compatibility.
That's exactly why changing it would be a "rejected idea". It certainly doesn't hurt to document the fact that we thought about it and decided not to change it for backward compatibility reasons.
As I understand Andrew doesn't like that *newline* parameter does too much:
- *newline* parameter turns on/off universal newline mode - it may specify the line separator e.g., newline='\r' - it specifies whether newline translation happens e.g., newline='' turns it off - together with *line_buffering*, it may enable flush() if newline is written
Exactly. And the fourth one only indirectly; "newline" flushing doesn't exactly mean _either_ of "\n" or the newline argument. And the related-but-definitely-not-the-same newlines attribute makes it even more confusing. (I've found bug reports with both Guido and Nick confused into thinking that newline was available as an attribute after construction; what hope do the rest of us have?) But the reality is, it rarely affects real-life programs, so it's definitely not worth breaking compatibility over. And it's still a whole lot cleaner than the 2.x design despite having a lot more details to deal with.
I've added a patch that demonstrates "no translation" for alternative newlines behavior http://bugs.python.org/issue1152248#msg224016 Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> writes:
On Thursday, July 24, 2014 2:08 AM, Akira Li <4kir4.1i@gmail.com> wrote:
Andrew Barnert <abarnert@yahoo.com> writes:
On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i@gmail.com> wrote:
In order to newline="\0" case to work, it should behave
similar to newline='' or newline='\n' case instead i.e., no translation should take place, to avoid corrupting embed "\n\r" characters.
The draft PEP discusses this. I think it would be more consistent to translate for \0, just like \r and \r\n.
I read the [draft]. No translation is a better choice here. Otherwise
(at the very least) it breaks `find -print0` use case.
No it doesn't. The only reason it breaks your code is that you add newline='\0' to your stdout wrapper as well as your stdin wrapper. If you just passed '', it would not do anything. And this is exactly parallel with the existing case with, e.g., trying to pass through a classic-Mac file full of '\r'-delimited strings that might contain embedded '\n' characters that you don't want to translate.
I won't repeat it several times but as you've already found out newline='\0' for stdout (at the very least) can be useful for line_buffering=True behavior. ...
There is also line_buffering parameter. From the docs:
If line_buffering is True, flush() is implied when a call to write contains a newline character.
The way this is actually defined seems broken to me; IIRC (I'll check the code later) it flushes on any '\r', and on any translated \n'. So, it's doing the wrong thing with '\r' in most modes, and with \n' in '' mode on non-Unix systems. So my thought was, just leave it broken.
Yes. I've found at least one issue http://bugs.python.org/issue22069
But now that I think about it, the existing code can only flush excessively, never insufficiently, and that's probably a property worth preserving. So maybe there _is_ a reason to pass newline for output without translation after all. In other words, the parameter may actually conflate _four_ things, not just three...
I'll need to think this through (and reread the code) this weekend; thanks for bringing it up.
-- Akira
On Jul 25, 2014, at 19:13, Akira Li <4kir4.1i@gmail.com> wrote:
I've added a patch that demonstrates "no translation" for alternative newlines behavior http://bugs.python.org/issue1152248#msg224016
Having taken a better look at the line buffering code, I now agree with you that this is necessary; otherwise we'd have to make a much bigger change to the implementation (which I don't think we want). When I update the draft PEP I'll change that and add a rationale (this also makes the rationale for "no translation for binary files" and for "only readnl is exposed, not writenl" a lot simpler). I'll also change it in my C patch (which I hope to be able to clean up and upload this weekend).
Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> writes:
On Thursday, July 24, 2014 2:08 AM, Akira Li <4kir4.1i@gmail.com> wrote:
Andrew Barnert <abarnert@yahoo.com> writes:
On Jul 23, 2014, at 5:13, Akira Li <4kir4.1i@gmail.com> wrote:
In order to newline="\0" case to work, it should behave
similar to newline='' or newline='\n' case instead i.e., no translation should take place, to avoid corrupting embed "\n\r" characters.
The draft PEP discusses this. I think it would be more consistent to translate for \0, just like \r and \r\n.
I read the [draft]. No translation is a better choice here. Otherwise
(at the very least) it breaks `find -print0` use case.
No it doesn't. The only reason it breaks your code is that you add newline='\0' to your stdout wrapper as well as your stdin wrapper. If you just passed '', it would not do anything. And this is exactly parallel with the existing case with, e.g., trying to pass through a classic-Mac file full of '\r'-delimited strings that might contain embedded '\n' characters that you don't want to translate.
I won't repeat it several times but as you've already found out newline='\0' for stdout (at the very least) can be useful for line_buffering=True behavior.
...
There is also line_buffering parameter. From the docs:
If line_buffering is True, flush() is implied when a call to write contains a newline character.
The way this is actually defined seems broken to me; IIRC (I'll check the code later) it flushes on any '\r', and on any translated \n'. So, it's doing the wrong thing with '\r' in most modes, and with \n' in '' mode on non-Unix systems. So my thought was, just leave it broken.
Yes. I've found at least one issue http://bugs.python.org/issue22069
But now that I think about it, the existing code can only flush excessively, never insufficiently, and that's probably a property worth preserving. So maybe there _is_ a reason to pass newline for output without translation after all. In other words, the parameter may actually conflate _four_ things, not just three...
I'll need to think this through (and reread the code) this weekend; thanks for bringing it up.
-- Akira
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
participants (3)
-
Akira Li
-
Andrew Barnert
-
Nick Coghlan