Iterating non-newline-separated files should be easier
tl;dr: readline and friends should take an optional sep parameter (which also means adding an iterlines method). Recently, I was trying to add -0 support to a command-line tool, which means that it reads filenames out of stdin and/or a text file with \0 separators instead of \n. This means that my code that looked like this: with open(path, encoding=sys.getfilesystemencoding()) as f: for filename in f: do_stuff(filename) … turned into this (from memory, not the exact code): def resplit(chunks, sep): buf = b'' for chunk in chunks: parts = (buf+chunk).split(sep) yield from parts[:-1] buf = parts[-1] if buf: yield buf with open(path, 'rb') as f: chunks = iter(lambda: f.read(4096), b'') for line in resplit(chunks, b'\0'): filename = line.decode(sys.getfilesystemencoding()) do_stuff(filename) Besides being a lot more code (and involving things that a novice might have problems reading like that two-argument iter), this also means that the file pointer is way ahead of the line that's just been iterated, I'm inefficiently buffering everything twice, etc. The problem is that readline is hardcoded to look for b'\n' for binary files, smart-universal-newline-thingy for text files, there's no way to reuse its machinery if you want to look for something different, and there's no way to access the internals that it uses if you want to reimplement it. While it might be possible to fix the latter problems in some generic and flexible way, that doesn't seem all that useful; really, other than changing the way readline splits, I don't think anyone wants to hook anything else about file objects. (On the other hand, people might want to hook it in more complex ways—e.g., pass a separator function instead of a separator string? I'm probably reaching there…) If I'm right, all that's needed is an extra sep=None keyword-only parameter to readline and friends (where None means the existing newline behavior), along with an iterlines method that's identical to __iter__ except that it has room for that new parameter. One minor side problem: Sometimes you don't actually have a file, but some kind of file-like object. I realize that as 3.1 or so, this is supposed to mean it actually is an io.BufferedIOBase or etc., but there are still plenty of third-party modules that just demand and/or provide "something with read(size)" or the like. In fact, that's the case with the problem I ran into above; another feature uses a third-party module to provide file-like objects for members of all kinds of uncommon archive types, and unlike zipfile, that module wasn't changed to provide io subclasses when it was ported to 3.x. So, it might be worth having adapters that make it easier (or just possible…) to wrap such a thing in the actual io interfaces. (The existing wrappers aren't adapters—BufferedReader demands readinto(buf), not read(size); TextIOWrapper can only wrap a BufferedIOBase.) But that's really a separate issue (and the answer to that one may just be to hold firm with the "file-like object means IOBase" and eventually every library you care about will work that way, even if you occasionally have to fix it yourself).
I think it's fine to add something to stdlib that encapsulates your example. (TBD: where?) I don't think it is reasonable to add a new parameter to readline(), because streams are widely implemented using duck typing -- every implementation would have to be updated to support this. On Thu, Jul 17, 2014 at 12:53 PM, Andrew Barnert < abarnert@yahoo.com.dmarc.invalid> wrote:
tl;dr: readline and friends should take an optional sep parameter (which also means adding an iterlines method).
Recently, I was trying to add -0 support to a command-line tool, which means that it reads filenames out of stdin and/or a text file with \0 separators instead of \n.
This means that my code that looked like this:
with open(path, encoding=sys.getfilesystemencoding()) as f: for filename in f: do_stuff(filename)
… turned into this (from memory, not the exact code):
def resplit(chunks, sep): buf = b'' for chunk in chunks: parts = (buf+chunk).split(sep)
yield from parts[:-1] buf = parts[-1] if buf: yield buf
with open(path, 'rb') as f: chunks = iter(lambda: f.read(4096), b'') for line in resplit(chunks, b'\0'): filename = line.decode(sys.getfilesystemencoding()) do_stuff(filename)
Besides being a lot more code (and involving things that a novice might have problems reading like that two-argument iter), this also means that the file pointer is way ahead of the line that's just been iterated, I'm inefficiently buffering everything twice, etc.
The problem is that readline is hardcoded to look for b'\n' for binary files, smart-universal-newline-thingy for text files, there's no way to reuse its machinery if you want to look for something different, and there's no way to access the internals that it uses if you want to reimplement it.
While it might be possible to fix the latter problems in some generic and flexible way, that doesn't seem all that useful; really, other than changing the way readline splits, I don't think anyone wants to hook anything else about file objects. (On the other hand, people might want to hook it in more complex ways—e.g., pass a separator function instead of a separator string? I'm probably reaching there…)
If I'm right, all that's needed is an extra sep=None keyword-only parameter to readline and friends (where None means the existing newline behavior), along with an iterlines method that's identical to __iter__ except that it has room for that new parameter.
One minor side problem: Sometimes you don't actually have a file, but some kind of file-like object. I realize that as 3.1 or so, this is supposed to mean it actually is an io.BufferedIOBase or etc., but there are still plenty of third-party modules that just demand and/or provide "something with read(size)" or the like. In fact, that's the case with the problem I ran into above; another feature uses a third-party module to provide file-like objects for members of all kinds of uncommon archive types, and unlike zipfile, that module wasn't changed to provide io subclasses when it was ported to 3.x. So, it might be worth having adapters that make it easier (or just possible…) to wrap such a thing in the actual io interfaces. (The existing wrappers aren't adapters—BufferedReader demands readinto(buf), not read(size); TextIOWrapper can only wrap a BufferedIOBase.) But that's really a separate issue (and the answer to that one may just be to hold firm with the "file-like object means IOBase" and eventually every library you care about will work that way, even if you occasionally have to fix it yourself). _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido)
I don't think it is reasonable to add a new parameter to readline(), because streams are widely implemented using duck typing -- every implementation would have to be updated to support this.
Could the "split" (or splitline) keyword-only parameter instead be passed to the open function (and the __init__ of IOBase and be stored there)?
On Thursday, July 17, 2014 2:40 PM, Alexander Heger <python@2sn.net> wrote:
I don't think it is reasonable to add a new parameter to readline(), because streams are widely implemented using duck typing -- every implementation would have to be updated to support this.
Could the "split" (or splitline) keyword-only parameter instead be passed to the open function (and the __init__ of IOBase and be stored there)?
Good idea. It's less powerful/flexible, but probably good enough for almost all use cases. (I can't think of any file where I'd need to split part of it on \0 and the rest on \n…) Also, it means you can stick with the normal __iter__ instead of needing a separate iterlines method. And, since open/__init__/etc. isn't part of the protocol, it's perfectly fine for the builtin open, etc., to be an example or template that's generally worth following if there's no good reason not to do so, rather than a requirement that must be followed. So, if I'm getting file-like objects handed to me by some third-party library or plugin API or whatever, and I need them to be \0-separated, in many cases the problems with resplit won't be an issue so I can just use it as a workaround, and in the remaining cases, I can request that the library/app/whatever add the sep parameter to the next iteration of the API. So, I retract my original suggestion in favor of this one. And, separately, Guido's idea of adding the helpers (or at least resplit, plus documentation on how to write the other stuff) to the stdlib somewhere. Thanks.
On Thursday, July 17, 2014 3:21 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
On Thursday, July 17, 2014 2:40 PM, Alexander Heger <python@2sn.net> wrote:
Could the "split" (or splitline) keyword-only parameter instead be passed to the open function (and the __init__ of IOBase and be stored there)?
Good idea. It's less powerful/flexible, but probably good enough for almost all use cases. (I can't think of any file where I'd need to split part of it on \0 and the rest on \n…) Also, it means you can stick with the normal __iter__ instead of needing a separate iterlines method.
It turns out to be even simpler than I expected. I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO. For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it. (Of course you'd also want to add it to all of the stdlib cases like zipfile.ZipFile.open/zipfile.ExtZipFile.__init__, but there aren't too many of those.) This means that the buffer underlying a text file with a non-standard newline doesn't automatically have a matching newline. I think that's a good thing ('\r\n' and '\r' would need exceptions for backward compatibility; '\0'.encode('utf-16-le') isn't a very useful thing to split on; etc.), but doing it the other way is almost as easy, and very little code will never care.
On Thu, Jul 17, 2014 at 05:04:00PM -0700, Andrew Barnert wrote:
It turns out to be even simpler than I expected.
I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO.
For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it.
All the words are in English, but I have no idea what you're actually saying... :-) You seem to be talking about the implementation of the change, but what is the interface? Having made all these changes, how does it effect Python code? You have a use-case of splitting on something other than the standard newlines, so how does one do that? E.g. suppose I have a file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line character. How would I iterate over lines in this file?
This means that the buffer underlying a text file with a non-standard newline doesn't automatically have a matching newline.
I don't understand what you mean by this. -- Steven
On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano <steve@pearwood.info> wrote:
On Thu, Jul 17, 2014 at 05:04:00PM -0700, Andrew Barnert wrote:
It turns out to be even simpler than I expected.
I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO.
For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it.
All the words are in English, but I have no idea what you're actually saying... :-)
You seem to be talking about the implementation of the change, but what is the interface? Having made all these changes, how does it effect Python code? You have a use-case of splitting on something other than the standard newlines, so how does one do that? E.g. suppose I have a file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line character. How would I iterate over lines in this file?
The way I understand it is this: for line in open("spam.txt", newline="\u0085"): process(line) If that's the case, I would be strongly in favour of this. Nice and clean, and should break nothing; there'll be special cases for newline=None and newline='', and the only change is that, instead of a small number of permitted values ('\n', '\r', '\r\n'), any string (or maybe any one-character string plus '\r\n'?) would be permitted. Effectively, it's not "iterate over this file, divided by \0 instead of newlines", but it's "this file uses the unusual encoding of newline=\0, now iterate over lines in the file". Seems a smart way to do it IMO. ChrisA
On Jul 17, 2014, at 20:36, Chris Angelico <rosuav@gmail.com> wrote:
On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano <steve@pearwood.info> wrote:
You seem to be talking about the implementation of the change, but what is the interface? Having made all these changes, how does it effect Python code? You have a use-case of splitting on something other than the standard newlines, so how does one do that? E.g. suppose I have a file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line character. How would I iterate over lines in this file?
The way I understand it is this:
for line in open("spam.txt", newline="\u0085"): process(line)
If that's the case, I would be strongly in favour of this. Nice and clean, and should break nothing; there'll be special cases for newline=None and newline='', and the only change is that, instead of a small number of permitted values ('\n', '\r', '\r\n'), any string (or maybe any one-character string plus '\r\n'?) would be permitted.
Effectively, it's not "iterate over this file, divided by \0 instead of newlines", but it's "this file uses the unusual encoding of newline=\0, now iterate over lines in the file". Seems a smart way to do it IMO.
Exactly. As soon as Alexander suggested it, I immediately knew it was much better than my original idea. (Apologies for overestimating the obviousness of that.)
Well, I had to look up the newline option for open(), even though I probably invented it. :-) Would it still apply only to text files? On Thursday, July 17, 2014, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
On Jul 17, 2014, at 20:36, Chris Angelico <rosuav@gmail.com <javascript:;>> wrote:
On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano <steve@pearwood.info <javascript:;>> wrote:
You seem to be talking about the implementation of the change, but what is the interface? Having made all these changes, how does it effect Python code? You have a use-case of splitting on something other than the standard newlines, so how does one do that? E.g. suppose I have a file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line character. How would I iterate over lines in this file?
The way I understand it is this:
for line in open("spam.txt", newline="\u0085"): process(line)
If that's the case, I would be strongly in favour of this. Nice and clean, and should break nothing; there'll be special cases for newline=None and newline='', and the only change is that, instead of a small number of permitted values ('\n', '\r', '\r\n'), any string (or maybe any one-character string plus '\r\n'?) would be permitted.
Effectively, it's not "iterate over this file, divided by \0 instead of newlines", but it's "this file uses the unusual encoding of newline=\0, now iterate over lines in the file". Seems a smart way to do it IMO.
Exactly. As soon as Alexander suggested it, I immediately knew it was much better than my original idea.
(Apologies for overestimating the obviousness of that.)
_______________________________________________ Python-ideas mailing list Python-ideas@python.org <javascript:;> https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (on iPad)
On Jul 17, 2014, at 21:47, Guido van Rossum <guido@python.org> wrote:
Well, I had to look up the newline option for open(), even though I probably invented it. :-)
While we're at it, I think most places in the documentation and docstrings that refer to the parameter, except open itself, call it newlines (e.g., io.IOBase.readline), and as far as I can tell it's been like that from day one, which shows just how much people pay attention to the current feature. :)
Would it still apply only to text files?
I think it makes sense to apply to binary files as well. Splitting binary files on \0 (or, for that matter, \r\n...) is probably at least as common a use case as text files. Obviously the special treatment for "" (as a universal-newline-behavior flag) wouldn't carry over to b"" (which might as well just be an error, although I suppose it could also mean to split on every byte, as with bytes.split?). Also, I'm not sure if the write behavior (replace terminal "\n" with newline) should carry over from text to binary, or just ignore newline on write. Binary files don't need the special-casing for b"" (with text files, that's more a universal-newlines flag than a newline value), and I'm not sure if they need the write behavior or only the read behavior.
On Thursday, July 17, 2014, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
On Jul 17, 2014, at 20:36, Chris Angelico <rosuav@gmail.com> wrote:
On Fri, Jul 18, 2014 at 1:21 PM, Steven D'Aprano <steve@pearwood.info> wrote:
You seem to be talking about the implementation of the change, but what is the interface? Having made all these changes, how does it effect Python code? You have a use-case of splitting on something other than the standard newlines, so how does one do that? E.g. suppose I have a file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line character. How would I iterate over lines in this file?
The way I understand it is this:
for line in open("spam.txt", newline="\u0085"): process(line)
If that's the case, I would be strongly in favour of this. Nice and clean, and should break nothing; there'll be special cases for newline=None and newline='', and the only change is that, instead of a small number of permitted values ('\n', '\r', '\r\n'), any string (or maybe any one-character string plus '\r\n'?) would be permitted.
Effectively, it's not "iterate over this file, divided by \0 instead of newlines", but it's "this file uses the unusual encoding of newline=\0, now iterate over lines in the file". Seems a smart way to do it IMO.
Exactly. As soon as Alexander suggested it, I immediately knew it was much better than my original idea.
(Apologies for overestimating the obviousness of that.)
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (on iPad) _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Jul 17, 2014, at 20:21, Steven D'Aprano <steve@pearwood.info> wrote:
On Thu, Jul 17, 2014 at 05:04:00PM -0700, Andrew Barnert wrote:
It turns out to be even simpler than I expected.
I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO.
For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it.
All the words are in English, but I have no idea what you're actually saying... :-)
You seem to be talking about the implementation of the change, but what is the interface?
"I reused the newline parameter." My mistake was assuming that was so simple, nothing else needed to be said. But that only works if everyone went back and completely read the previous suggestions, which I realize nobody had any good reason to do. Basically, the only change to the API is that it's no longer an error to pass arbitrary strings (or bytes, for binary mode) for newlines. The rules for how "\0" are handled are identical to the rules for "\r". There's almost nothing else to explain, but not quite--so, like an idiot, I dove into the minor nits in detail, skipping over the main point.
Having made all these changes, how does it effect Python code?
Existing legal code does not change at all. Some code that used to be an error now does something useful (see below).
You have a use-case of splitting on something other than the standard newlines, so how does one do that? E.g. suppose I have a file "spam.txt" which uses NEL (Next Line, U+0085) as the end of line character. How would I iterate over lines in this file?
with open("spam.txt", newline="\u0085") as f: for line in f: process(line)
This means that the buffer underlying a text file with a non-standard newline doesn't automatically have a matching newline.
I don't understand what you mean by this.
If you write this: with open("spam.txt", newline="\u0085") as f: for line in f.buffer: The bytes you get back will be split on b"\n", not on "\u0085".encode(locale.getdefaultencoding()). The newlines applies only to the text file, not its underlying binary buffer. (This is exactly the same as the current behavior--if you open a file with newline='\r' in 3.4 then iterate f.buffer, it's still going to split on b'\n', not b'\r'.)
On 07/18/2014 02:04 AM, Andrew Barnert wrote:
On Thursday, July 17, 2014 3:21 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
On Thursday, July 17, 2014 2:40 PM, Alexander Heger <python@2sn.net> wrote:
Could the "split" (or splitline) keyword-only parameter instead be passed to the open function (and the __init__ of IOBase and be stored there)?
Good idea. It's less powerful/flexible, but probably good enough for almost all use cases. (I can't think of any file where I'd need to split part of it on \0 and the rest on \n…) Also, it means you can stick with the normal __iter__ instead of needing a separate iterlines method.
It turns out to be even simpler than I expected.
I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO.
For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it.
You are not the first one to come up with this idea and suggesting solutions. This whole thing has been hanging around on the bug tracker as an unresolved issue (started by Nick Coghlan) since almost a decade: http://bugs.python.org/issue1152248 Ever since discovering it, I've been sticking to the recipe provided by Douglas Alan: http://bugs.python.org/issue1152248#msg109117 Not that I wouldn't like to see this feature to be shipping with Python, but it may help to read through all aspects of the problem that have been discussed before. Best, Wolfgang
Before responding to Wolfgang, something that occurred to me overnight: The only insurmountable problem with Guido's suggestion of "just unwrap and rewrap the raw or buffer in a subclass that adds this behavior" is that you can't write such a subclass of TextIOWrapper, because it has no way to either peek at or push back onto the buffer. So... Why not add one of those? Pushing back is easier to implement (since it's already there as a private method), but a bit funky, and peeking would mean it works the same way as with buffered binary files. But I'll take a look at the idiomatic way to do similar things in other languages (C stdio, C++ iostreams, etc.), and make sure that peek is actually sensible for TextIOWrapper, before arguing for it. While we're at it, it might be nice for the peek method to be documented as an (optional, like raw, etc.?) member of the two ABCs instead of just something that one implementation happens to have, and that the mixin code will use if it happens to be present. (Binary readline uses peek if it exists, falls back to byte by byte if not.) On Jul 18, 2014, at 4:53, Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> wrote:
On 07/18/2014 02:04 AM, Andrew Barnert wrote:
On Thursday, July 17, 2014 3:21 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
On Thursday, July 17, 2014 2:40 PM, Alexander Heger <python@2sn.net> wrote:
Could the "split" (or splitline) keyword-only parameter instead be passed to the open function (and the __init__ of IOBase and be stored there)?
Good idea. It's less powerful/flexible, but probably good enough for almost all use cases. (I can't think of any file where I'd need to split part of it on \0 and the rest on \n…) Also, it means you can stick with the normal __iter__ instead of needing a separate iterlines method.
It turns out to be even simpler than I expected.
I reused the "newline" parameter of open and TextIOWrapper.__init__, adding a param of the same name to the constructors for BufferedReader, BufferedWriter, BufferedRWPair, BufferedRandom, and FileIO.
For text files, just remove the check for newline being one of the standard values and it all works. For binary files, remove the check for truthy, make open pass each Buffered* constructor newline=(newline if binary else None), make each Buffered* class store it, and change two lines in RawIOBase.readline to use it. And that's it.
You are not the first one to come up with this idea and suggesting solutions. This whole thing has been hanging around on the bug tracker as an unresolved issue (started by Nick Coghlan) since almost a decade:
http://bugs.python.org/issue1152248
Ever since discovering it, I've been sticking to the recipe provided by Douglas Alan:
Thanks. Douglas's recipe is effectively the same as my resplit, except less general (since it consumes a file rather than any iterable), and some, but not all, of the limitations of that approach were mentioned. And R. David Murray's hack patch is the basically the same as the text half of my patch. The discussion there is also useful, as it raises the similar features in perl, awk, bash, etc.--all of which work by having the user change either a global or something on the file object, rather than putting it in the line-reading code, which reinforces my belief that Alexander's idea of putting the separator value it in the file constructors was right, and my initially putting it in readline or a new readuntil method was wrong.
Not that I wouldn't like to see this feature to be shipping with Python, but it may help to read through all aspects of the problem that have been discussed before.
Best, Wolfgang
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 18 July 2014 12:43, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
Before responding to Wolfgang, something that occurred to me overnight: The only insurmountable problem with Guido's suggestion of "just unwrap and rewrap the raw or buffer in a subclass that adds this behavior" is that you can't write such a subclass of TextIOWrapper, because it has no way to either peek at or push back onto the buffer. So... Why not add one of those?
Pushing back is easier to implement (since it's already there as a private method), but a bit funky, and peeking would mean it works the same way as with buffered binary files. But I'll take a look at the idiomatic way to do similar things in other languages (C stdio, C++ iostreams, etc.), and make sure that peek is actually sensible for TextIOWrapper, before arguing for it.
While we're at it, it might be nice for the peek method to be documented as an (optional, like raw, etc.?) member of the two ABCs instead of just something that one implementation happens to have, and that the mixin code will use if it happens to be present. (Binary readline uses peek if it exists, falls back to byte by byte if not.)
Slight tangent, but this rewrapping question also arises in the context of changing encodings on an already open stream. See http://bugs.python.org/issue15216 for (the gory) details.
On Jul 18, 2014, at 4:53, Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> wrote:
You are not the first one to come up with this idea and suggesting solutions. This whole thing has been hanging around on the bug tracker as an unresolved issue (started by Nick Coghlan) since almost a decade:
http://bugs.python.org/issue1152248
Ever since discovering it, I've been sticking to the recipe provided by Douglas Alan:
Thanks.
Douglas's recipe is effectively the same as my resplit, except less general (since it consumes a file rather than any iterable), and some, but not all, of the limitations of that approach were mentioned. And R. David Murray's hack patch is the basically the same as the text half of my patch.
The discussion there is also useful, as it raises the similar features in perl, awk, bash, etc.--all of which work by having the user change either a global or something on the file object, rather than putting it in the line-reading code, which reinforces my belief that Alexander's idea of putting the separator value it in the file constructors was right, and my initially putting it in readline or a new readuntil method was wrong.
I still favour my proposal there to add a separate "readrecords()" method, rather than reusing the line based iteration methods - lines and arbitrary records *aren't* the same thing, and I don't think we'd be doing anybody any favours by conflating them (whether we're confusing them at the method level or at the constructor argument level). While, as an implementation artifact, it may be possible to get this "easily" by abusing the existing newline parameter, that's likely to break a lot of assumptions in *other* code, that specifically expects newlines to refer to actual line endings. A new separate method cleanly isolates the feature to code that wants to use it, preventing potentially adverse and hard to debug impacts on unrelated code that happens to receive a file object with a custom record separator configured. With this kind of proposal, it isn't the "what happens when it works?" cases that worry me - it's the cases where it *fails* and someone is stuck with figuring out what has gone wrong. A new method fails cleanly, but changing the semantics of *existing* arguments, attributes and methods? That doesn't fail cleanly at all, and can also have far reaching impacts on the correctness of all sorts of documentation. Attempting to wedge this functionality into *existing* constructs means *changing* a lot of expectations that are now well established in a Python context. By contrast, adding a *new* construct, specifically for this purpose, means nothing needs to change with existing constructs, we don't inadvertently introduce even more obscure corner cases in newline handling, and there's a solid terminology hook to hang the documentation one (iteration by line vs iteration by record - and we can also be clear that "line buffered" really does correspond to iteration by line, and may not be available for arbitrary record separators). Providing this feature as a separate method also makes it possible for the IO ABC's to provide a default implementation (along the lines of your resplit function), that concrete implementations can optionally override with something more optimised. Pure ducktyped cases (not inheriting from the ABCs) will fail with a fairly obvious error ("AttributeError: 'MyCustomFileType' object has no attribute 'readrecords'" rather than something related to unknown parameter names or illegal argument values), while those that do inherit from the ABCs will "just work". Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I still favour my proposal there to add a separate "readrecords()" method, rather than reusing the line based iteration methods - lines and arbitrary records *aren't* the same thing
But they might well be the same thing. Look at all the Unix commands that usually separate output with \n, but can be told to separate with \0 instead. If you're reading from something like that, it should be just as easy to split on \n as on \0. ChrisA
On 19 July 2014 03:32, Chris Angelico <rosuav@gmail.com> wrote:
On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I still favour my proposal there to add a separate "readrecords()" method, rather than reusing the line based iteration methods - lines and arbitrary records *aren't* the same thing
But they might well be the same thing. Look at all the Unix commands that usually separate output with \n, but can be told to separate with \0 instead. If you're reading from something like that, it should be just as easy to split on \n as on \0.
Python isn't Unix, and Python has never supported \0 as a "line ending". Changing the meaning of existing constructs is fraught with complexity, and should only be done when there is absolutely no alternative. In this case, there's an alternative: a new method, specifically for reading arbitrary records. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sat, Jul 19, 2014 at 04:18:35AM -0400, Nick Coghlan wrote:
On 19 July 2014 03:32, Chris Angelico <rosuav@gmail.com> wrote:
On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I still favour my proposal there to add a separate "readrecords()" method, rather than reusing the line based iteration methods - lines and arbitrary records *aren't* the same thing
But they might well be the same thing. Look at all the Unix commands that usually separate output with \n, but can be told to separate with \0 instead. If you're reading from something like that, it should be just as easy to split on \n as on \0.
Python isn't Unix, and Python has never supported \0 as a "line ending". Changing the meaning of existing constructs is fraught with complexity, and should only be done when there is absolutely no alternative. In this case, there's an alternative: a new method, specifically for reading arbitrary records.
I don't have an opinion one way or the other, but I don't quite see why you're worried about allowing the newline parameter to be set to some arbitrary separator. The best I can come up with is a scenario something like this: I open a file with some record-separator fp = open(filename, newline="\0") then pass it to a function: spam(fp) which assumes that each chunk ends with a linefeed: assert next(fp).endswith('\n') But in a case like that, the function is already buggy. I can see at least two problems with such an assumption: - what if universal newlines has been turned off and you're reading a file created under (e.g.) classic Mac OS or RISC OS? - what if the file contains a single line which does not end with an end of line character at all? open('/tmp/junk', 'wb').write("hello world!") next(open('/tmp/junk', 'r')) Have I missed something? Although I'm don't mind whether files grow a readrecords() method, or re-use the readlines() method, I'm not convinced that API decisions should be driven solely by the needs of programs which are already buggy. -- Steven
On 19 July 2014 05:01, Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, Jul 19, 2014 at 04:18:35AM -0400, Nick Coghlan wrote: But in a case like that, the function is already buggy. I can see at least two problems with such an assumption:
- what if universal newlines has been turned off and you're reading a file created under (e.g.) classic Mac OS or RISC OS?
That's exactly the point though - people *do* assume "\n", and we've gone to great lengths to make that assumption *more correct* (even though it's still wrong sometimes). We can't reverse course on that, and expect the outcome to make sense to *people*. When making use of a configurable line endings feature breaks (and it will), they're going to be confused, and the docs likely aren't going to help much.
- what if the file contains a single line which does not end with an end of line character at all?
open('/tmp/junk', 'wb').write("hello world!") next(open('/tmp/junk', 'r'))
Have I missed something?
Although I'm don't mind whether files grow a readrecords() method, or re-use the readlines() method, I'm not convinced that API decisions should be driven solely by the needs of programs which are already buggy.
It's not being driven by the needs of programs that are already buggy - my preferences are driven by the fact that line endings and record separators are *not the same thing*. Thinking that they are is a matter of confusing the conceptual data model with the implementation of the framing at the serialisation layer. If we *do* try to treat them as the same thing, then we have to go find *every single reference* to line endings in the documentation and add a caveat about it being configurable at file object creation time, so it might actually be based on something completely arbitrary. Line endings are *already* confusing enough that the "universal newlines" mechanism was added to make it so that Python level code could mostly ignore the whole "\n" vs "\r" vs "\r\n" distinction, and just assume "\n" everywhere. This is why I'm a fan of keeping things comparatively simple, and just adding a new method (if we only add an iterator version) or two (if we add a list version as well) specifically for this use case. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 19 July 2014 10:01, Steven D'Aprano <steve@pearwood.info> wrote:
I open a file with some record-separator
fp = open(filename, newline="\0")
then pass it to a function:
spam(fp)
which assumes that each chunk ends with a linefeed:
assert next(fp).endswith('\n')
I will often do for line in fp: line = line.strip() to remove the line ending ("record separator"). This fails if you have an arbitrary separator. And for that matter, how would you remove an arbitrary separator? Maybe line = line[:-1] works, but what if at some point people ask for multi-character separators ("\n\n" for "paragraph separated", for example - ignoring the universal newline complexities in that). A splitrecord method still needs a means for code to to remove the record separator, of course, but the above demonstrates how reusing line separation could break the assumptions of *current* code. Paul
Paul Moore wrote:
And for that matter, how would you remove an arbitrary separator? Maybe line = line[:-1] works, but what if at some point people ask for multi-character separators
If the newline mechanism is re-used, it would convert whatever separator is used into '\n'. -- Greg
On Saturday, July 19, 2014 9:42 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Paul Moore wrote: And for that matter, how would you remove an arbitrary separator? Maybe line = line[:-1] works, but what if at some point people ask for multi-character separators
You already can't use line[:-1] today, because '\r\n' is already a valid value, and always has been. And however people deal with newline='\r\n' will work for any crazy separator you can think of. Maybe line[:-len(nl)]. Maybe line.rstrip(nl) if it's appropriate (it isn't always, either for \r\n or for some arbitrary separator).
If the newline mechanism is re-used, it would
convert whatever separator is used into '\n'.
No it wouldn't. https://docs.python.org/3/library/io.html#io.TextIOWrapper
When reading input from the stream, if newline is None, universal newlines mode is enabled… If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.
So, making '\0' a legal value just means the '\0' line endings will be returned to the caller untranslated. Also, remember that binary files don't do universal newline translation ever, so just letting you change the separator there wouldn't add translation. Of course both of those could be changed as well (although with what interface, I'm not sure…), but I don't think they should be.
If and when something is decided in this thread, can someone summarize it to me? I don't have time to read all the lengthy arguments but I do care about the outcome. -- --Guido van Rossum (python.org/~guido)
Per Nick's suggestion, I will write up a draft PEP, and link it to issue #1152248, which should be a lot easier to follow. If you want to wait until the first round of discussion and the corresponding update to the PEP before checking in, I'll make sure it's obvious when that's happened. Sent from a random iPhone On Jul 19, 2014, at 22:45, Guido van Rossum <guido@python.org> wrote:
If and when something is decided in this thread, can someone summarize it to me? I don't have time to read all the lengthy arguments but I do care about the outcome.
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Le 19/07/2014 05:01, Steven D'Aprano a écrit :
I open a file with some record-separator
fp = open(filename, newline="\0")
Hmm... newline="\0" already *looks* wrong. To me, it's a hint that you're abusing the API. The main advantage of it, though, is that you can use iteration in addition to the regular readline() (or readrecord()) method. Regards Antoine.
On 2014-07-19 10:01, Steven D'Aprano wrote: [snip]
- what if universal newlines has been turned off and you're reading a file created under (e.g.) classic Mac OS or RISC OS?
[snip] FTR, the line ending in RISC OS is '\n'.
I don't have time for this thread. I never meant to suggest anything that would require pushing back data into the buffer (you must have misread me). I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way). I personally think it's preposterous to use \0 as a separator for text files (nothing screams binary data like a null byte :-). I don't think it's a big deal if a method named readline() returns a record that doesn't end in a \n character. I value the equivalence of __next__() and readline(). I still think you should solve this using a wrapper class (that does its own buffering if necessary, and implements the rest of the stream protocol for the benefit of other consumers of some of the data). Once a suitable wrapper class has been implemented as a 3rd party module and is in common use you may petition to have it added to the standard library, as a separate module/class/function. -- --Guido van Rossum (python.org/~guido)
On 19 Jul 2014, at 22:05, Guido van Rossum <guido@python.org> wrote:
I don't have time for this thread.
I never meant to suggest anything that would require pushing back data into the buffer (you must have misread me).
I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way).
I see another problem with doing this by modifying the open() call: it does not work for filehandles creates using other methods such as pipe() or socket(), either used directly or via subprocess. There are have real-world examples of situations where that is very useful. One of them was even mentioned in this discussion: processing the output of find -0. Wichert.
On Jul 20, 2014, at 0:50, Wichert Akkerman <wichert@wiggy.net> wrote:
On 19 Jul 2014, at 22:05, Guido van Rossum <guido@python.org> wrote:
I don't have time for this thread.
I never meant to suggest anything that would require pushing back data into the buffer (you must have misread me).
I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way).
I see another problem with doing this by modifying the open() call: it does not work for filehandles creates using other methods such as pipe() or socket(), either used directly or via subprocess. There are have real-world examples of situations where that is very useful.
A socket() is not a python file object, doesn't have a similar API, and doesn't have a readline method. The result of calling socket.makefile, on the other hand, is a file object--and it's created by calling open.* And I'm pretty sure socket.makefile already takes a newline argument and just passes it along, in which case it will magically work with no changes at all.** IIRC, os.pipe() just returns a pair of fds (integers), not a file object at all. It's up to you to wrap that in a file object if you want to--which you do by passing it to the open function. So, neither of your objections works. There are some better examples you could have raised, however. For example, a bz2.BzipFile is created with bz2.open. And, while the file delegates to a BufferedReader or TextIOWrapper, bz2.open almost certainly validates its inputs and won't pass newline on to the BufferedReader in binary mode. So, it would have to be changed to get the benefit. However, given that there's no way to magically make every file-like object anyone has ever written automatically grow this new functionality, having the API change on the constructors, which are not part of any API and not consistent, is better than having it on the readline method. Think about where you'd get the error in each case: before even writing your code, when you look up how BzipFile instances are created and see there's no way to pass a newline argument, or deep in your code when you're using a file object that came from who knows where and it's readline method doesn't like the standard, documented newline argument? * Or maybe it's created by constructing a BufferedReader, BufferedWriter, BufferedRandom, or TextIOWrapper directly. I don't remember off hand. But it doesn't matter, because the suggestion is to put the new parameter in those constructors, and make open forward to them, so whether makefile calls them directly or via open, it gets the same effect. ** Unless it validates the arguments before passing them along. I looked over a few stdlib classes, and there was at least one that unnecessarily does the same validation open is going to do anyway, so obviously that needs to be removed before the class magically benefits. In some cases (like tempfile.NamedTemporaryFile), even that isn't necessary, because the implementation just passes through all **kwargs that it doesn't want to handle to the open or constructor call.
On 20 July 2014 12:53, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
There are some better examples you could have raised, however. For example, a bz2.BzipFile is created with bz2.open. And, while the file delegates to a BufferedReader or TextIOWrapper, bz2.open almost certainly validates its inputs and won't pass newline on to the BufferedReader in binary mode. So, it would have to be changed to get the benefit.
The most significant example is one which has been mentioned, but you may have missed. The motivation for this proposal is to interoperate with the -0 flag on things like the unix find command. But that is typically used in a pipe, which means your Python program will likely receive \0 terminated records via sys.stdin. And sys.stdin is already opened for you - you do not have the option to specify a newline argument. In actual fact, I can't think of a good example (either from my own experience, or mentioned in this thread) where I'd expect to be reading \0-terminated records from anything *except* sys.stdin. Paul
-- Clint
On Jul 20, 2014, at 9:42 AM, Paul Moore <p.f.moore@gmail.com> wrote:
In actual fact, I can't think of a good example (either from my own experience, or mentioned in this thread) where I'd expect to be reading \0-terminated records from anything *except* sys.stdin.
Named pipes and whatever is used to implement process substitution ( < <(find ... -0) ) come to mind.
On Sat, Jul 19, 2014 at 3:48 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Python isn't Unix, and Python has never supported \0 as a "line ending". Changing the meaning of existing constructs is fraught with complexity, and should only be done when there is absolutely no alternative. In this case, there's an alternative: a new method, specifically for reading arbitrary records.
"practicality beats purity." http://legacy.python.org/dev/peps/pep-0020/ -- Juancarlo *Añez*
(replies to multiple messages here) On Saturday, July 19, 2014 1:19 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 19 July 2014 03:32, Chris Angelico <rosuav@gmail.com> wrote:
On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
I still favour my proposal there to add a separate "readrecords()" method, rather than reusing the line based iteration methods - lines and arbitrary records *aren't* the same thing
But they might well be the same thing. Look at all the Unix commands that usually separate output with \n, but can be told to separate with \0 instead. If you're reading from something like that, it should be just as easy to split on \n as on \0.
Python isn't Unix, and Python has never supported \0 as a "line ending".
Well, yeah, but Python is used on Unix, and it's used to write scripts that interoperate with other Unix command-line tools. For the record, the reason this came up is that someone was trying to use one of my scripts in a pipeline with find -0, and he had no problem adapting the Perl scripts he's using to handle -0 output, but no clue how to do the same with my Python script. In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing—it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem.
Changing the meaning of existing constructs is fraught with complexity, and should only be done when there is absolutely no alternative. In this case, there's an alternative: a new method, specifically for reading arbitrary records.
This was basically my original suggestion, so obviously I don't think it's a terrible idea. But I don't think it's as good. First, which of these is more readable, easier for novices to figure out how to write, etc.: with open(path, newline='\0') as f: for line in f: handle(line.rstrip('\0')) with open(path) as f: for line in iter(lambda: f.readrecord('\0'), ''): handle(line.rstrip('\0')) Second, as Guido mentioned at the start of this thread, existing file-like object types (whether they implement BufferedIOBase or TextIOBase, or just duck-type the interfaces) are not going to have the new functionality. Construction has never been part of the interface of the file-like object API; opening a real file has always looked different from opening a member file in a zip archive or making a file-like wrapper around a socket transport or whatever. But using the resulting object has always been the same. Adding a readrecord method or changing the interface readline means that's no longer true. There might be a good argument for making the change more visible—that is, using a different parameter on the open call instead of reusing the existing newline. (And that's what Alexander originally suggested as an alternative to my readrecord idea.) That way, it's much more obvious that spam.open or eggs.makefile or whatever doesn't support alternate line endings, without having to read its documentation on what newline means. But either way, I think it should go in the open function, not the file-object API. On Saturday, July 19, 2014 2:28 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
- my preferences are driven by the fact that line endings and record separators are *not the same thing*. Thinking that they are is a matter of confusing the conceptual data model with the implementation of the framing at the serialisation layer.
Yes, using lines implicitly as records can lead to confusion—but people actually do that all the time; this isn't a new problem, and it's exactly the same problem with \r\n, or even \n, as with \0. When you open up TextEdit and write a grocery list with one item on each line, those newlines are not part of the items. When you pipe the output of find to a script, the newlines are not part of the filenames. When you pipe the output of find -0 to a script, the \0 terminators are not part of the filenames.
Line endings are *already* confusing enough that the "universal newlines" mechanism was added to make it so that Python level code could mostly ignore the whole "\n" vs "\r" vs "\r\n" distinction, and just assume "\n" everywhere.
I understand the point here. There are cases where universal newlines let you successfully ignore the confusion rather than dealing with it, and newline='\0' will not be useful in those cases. But then newline='\r' is also never useful in those cases. The new behavior will be useful in exactly the cases where '\r' already is—no more, but no less.
This is why I'm a fan of keeping things comparatively simple, and just adding a new method (if we only add an iterator version) or two (if we add a list version as well) specifically for this use case.
Actually, the obvious new method is neither the iterator version nor the list version, but a single-record version, readrecord. Sometimes you need readline/readrecord, and it's conceptually simpler for the user. And of course the implementation is a lot simpler; you don't need to build a new iterator object that references the file for readrecord the way you do for iterrecords. And finally, if you only have one of the two, as bad as iter(lambda: f.readrecord('\0'), '') may look to novices, next(f.iterrecords('\0')) would probably be even more confusing. But we could also add an iterrecords, for two methods. And as for the list-based version… well, I don't even understand why readlines still exists in 3.x (much less why the tutorial suggests it), so I'd be fine not having a readrecords, but I don't have any real objection. On Saturday, July 19, 2014 1:06 PM, Guido van Rossum <guido@python.org> wrote:
I never meant to suggest anything that would require pushing back data into the buffer (you must have misread me).
I get the feeling either there's a much simpler way to wrap a file object that I'm missing, or that you think there is. In order to do the equivalent of readrecord, you have to do one of three things: 1. Read character by character, which can be incredibly slow. 2. Peek or push back on the buffer, as the io classes' readline methods do. 3. Put another buffer in front of the file, which means you have two objects both sharing the same file but with effective file pointers out of sync. And you have to reproduce all of the file-like-object API methods for your new buffered object (a lot more work, and a lot more to get wrong—effectively, it means you have to write all of BufferedReader or TextIOWrapper, but modified to wrap another buffered file instead of wrapping the lower-level thing). And no matter how you do it, it's obviously going to be less efficient. If there's a lighter version of #3 that makes sense, I'm not seeing it. Which is probably a problem with my lack of insight, but I'd appreciate a pointer in the right direction.
I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way).
Maybe using a different argument is a better answer. (That's what Alexander suggested originally.) The reason both I and people on the bug thread suggested using newline instead is because the behavior you want from sep='\0' happens to be identical to the behavior you get from newline='\r', except with '\0' instead of '\r'. And that's the best argument I have for reusing newline: someone has already worked out and documented all the implications of newline, and people have already learned them, so if we really want the same functionality, it makes sense to reuse it. But I realize that argument only goes so far. It wasn't obvious, until I looked into it, that I wanted the exact same functionality.
I personally think it's preposterous to use \0 as a separator for text files (nothing screams binary data like a null byte :-).
Sure, it would have been a lot better for find and friends to grow a --escape parameter instead of -0, but I think that ship has sailed.
I don't think it's a big deal if a method named readline() returns a record that doesn't end in a \n character.
I value the equivalence of __next__() and readline().
I still think you should solve this using a wrapper class (that does its own buffering if necessary, and implements the rest of the stream protocol for the benefit of other consumers of some of the data).
Again, I don't see any way to do this sensibly that wouldn't be a whole lot more work than just forking the io package. But maybe that's the answer: I can write _io2 as a fork of _io with my changes, the same for _pyio2 (for PyPy), and then the only thing left to write is a __main__ for the package that wraps up _io2/_pyio2 in the io ABCs (and re-exports those ABCs).
On 20 Jul 2014 09:28, "Andrew Barnert" <abarnert@yahoo.com> wrote:
(replies to multiple messages here)
On Saturday, July 19, 2014 1:19 AM, Nick Coghlan <ncoghlan@gmail.com>
wrote:
On 19 July 2014 03:32, Chris Angelico <rosuav@gmail.com> wrote:
On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan@gmail.com>
I still favour my proposal there to add a separate "readrecords()" method, rather than reusing the line based iteration methods - lines and arbitrary records *aren't* the same thing
But they might well be the same thing. Look at all the Unix commands that usually separate output with \n, but can be told to separate with \0 instead. If you're reading from something like that, it should be just as easy to split on \n as on \0.
Python isn't Unix, and Python has never supported \0 as a "line ending".
Well, yeah, but Python is used on Unix, and it's used to write scripts
wrote: that interoperate with other Unix command-line tools.
For the record, the reason this came up is that someone was trying to use
one of my scripts in a pipeline with find -0, and he had no problem adapting the Perl scripts he's using to handle -0 output, but no clue how to do the same with my Python script.
In general, it's just as easy to write Unix command-line tools in Python
as in Perl, and that's a good thing—it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem. I would find adding NULL to the potential newline set significantly less objectionable than opening it up to arbitrary character sequences. Adding a single possible newline character is a much simpler change, and one likely to have far fewer odd consequences. This is especially so if specifying NULL as the line separator is only permitted for files opened in binary mode. Cheers, Nick.
On Sun, Jul 20, 2014 at 9:49 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Adding a single possible newline character is a much simpler change, and one likely to have far fewer odd consequences. This is especially so if specifying NULL as the line separator is only permitted for files opened in binary mode.
U+0000 is a valid Unicode character, so I'd have no objection to, for instance, splitting a UTF-8 encoded text file on \0. ChrisA
On 20 Jul 2014 09:49, "Nick Coghlan" <ncoghlan@gmail.com> wrote:
On 20 Jul 2014 09:28, "Andrew Barnert" <abarnert@yahoo.com> wrote:
(replies to multiple messages here)
On Saturday, July 19, 2014 1:19 AM, Nick Coghlan <ncoghlan@gmail.com>
On 19 July 2014 03:32, Chris Angelico <rosuav@gmail.com> wrote:
On Sat, Jul 19, 2014 at 5:10 PM, Nick Coghlan <ncoghlan@gmail.com>
wrote:
I still favour my proposal there to add a separate "readrecords()" method, rather than reusing the line based iteration methods - lines and arbitrary records *aren't* the same thing
But they might well be the same thing. Look at all the Unix commands that usually separate output with \n, but can be told to separate with \0 instead. If you're reading from something like that, it should be just as easy to split on \n as on \0.
Python isn't Unix, and Python has never supported \0 as a "line ending".
Well, yeah, but Python is used on Unix, and it's used to write scripts
For the record, the reason this came up is that someone was trying to
use one of my scripts in a pipeline with find -0, and he had no problem adapting the Perl scripts he's using to handle -0 output, but no clue how to do the same with my Python script.
In general, it's just as easy to write Unix command-line tools in
Python as in Perl, and that's a good thing—it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And
wrote: that interoperate with other Unix command-line tools. that's a problem.
I would find adding NULL to the potential newline set significantly less
objectionable than opening it up to arbitrary character sequences.
Adding a single possible newline character is a much simpler change, and
one likely to have far fewer odd consequences. This is especially so if specifying NULL as the line separator is only permitted for files opened in binary mode. Also, the interoperability argument is a good one, as is the analogy with '\r'. Since this does end up touching the open() builtin and the core IO abstractions, it will need a PEP. As far as implementation goes, I suspect a RecordIOWrapper layered IO model inspired by the approach used for TextIOWrapper may make sense. Cheers, Nick.
Cheers, Nick.
On Saturday, July 19, 2014 4:49 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 20 Jul 2014 09:28, "Andrew Barnert" <abarnert@yahoo.com> wrote:
In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing—it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem.
I would find adding NULL to the potential newline set significantly less objectionable than opening it up to arbitrary character sequences.
Adding a single possible newline character is a much simpler change, and one likely to have far fewer odd consequences. This is especially so if specifying NULL as the line separator is only permitted for files opened in binary mode.
But newline is only permitted for text mode. Are you suggesting that we add newline to binary mode, but the only allowed values are NULL (current behavior) and \0, while on text files the list of allowed values stays the same as today? Also, would you want the same semantics for newline='\0' on binary files that newline='\r' has on text files (including newline remapping on write)? And I'm still not sure why you think this shouldn't be allowed in text mode in the first place (especially given that you suggested the same thing for text files _only_ a few years ago). The output of file is a list of newline-separated or \0-separated filenames, in the filesystem's encoding. Why should I be able to handle the first as a text file, but have to handle the second as a binary file and then manually decode each line? You could argue that file -0 isn't really separating Unicode filenames with U+0000, but separating UTF-8 or Latin-1 or whatever filenames with \x00, and it's just a coincidence that they happen to match up. But it really isn't just a coincidence; it was an intentional design decision for Unicode (and UTF-8, and Latin-1) that the ASCII control characters map in the obvious way, and one that many tools and scripts take advantage of, so why shouldn't tools and scripts written in Python be able to take advantage of it?
On 20 July 2014 10:57, Andrew Barnert <abarnert@yahoo.com> wrote:
On Saturday, July 19, 2014 4:49 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 20 Jul 2014 09:28, "Andrew Barnert" <abarnert@yahoo.com> wrote:
In general, it's just as easy to write Unix command-line tools in Python as in Perl, and that's a good thing—it means I don't have to use Perl. But as soon as -0 comes into the mix, that's no longer true. And that's a problem.
I would find adding NULL to the potential newline set significantly less objectionable than opening it up to arbitrary character sequences.
Adding a single possible newline character is a much simpler change, and one likely to have far fewer odd consequences. This is especially so if specifying NULL as the line separator is only permitted for files opened in binary mode.
But newline is only permitted for text mode. Are you suggesting that we add newline to binary mode, but the only allowed values are NULL (current behavior) and \0, while on text files the list of allowed values stays the same as today?
Actually, I temporarily forgot that newline was only handled at the TextIOWrapper layer. All the more reason for a PEP that clearly lays out the status quo (both Python's own newline handling and the "-0" option for various UNIX utilities, and the way that is handled in other scripting langauges), and discusses the various options for dealing with it (new RecordIOWrapper class with a new "open" parameter, new methods on IO clases, new semantics on the existing TextIOWrapper class). If the description of the use cases is clear enough, then the "right answer" amongst the presented alternatives (which includes "don't change anything") may be obvious. At present, I'm genuinely unclear on why someone would ever want to pass the "-0" option to the other UNIX utilities, which then makes it very difficult to have a sensible discussion on how we should address that use case in Python. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At present, I'm genuinely unclear on why someone would ever want to pass the "-0" option to the other UNIX utilities, which then makes it very difficult to have a sensible discussion on how we should address that use case in Python.
That one's easy. What happens if you use 'find' to list files, and those files might have \n in their names? You need another sep. ChrisA
On 20 July 2014 11:31, Chris Angelico <rosuav@gmail.com> wrote:
On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At present, I'm genuinely unclear on why someone would ever want to pass the "-0" option to the other UNIX utilities, which then makes it very difficult to have a sensible discussion on how we should address that use case in Python.
That one's easy. What happens if you use 'find' to list files, and those files might have \n in their names? You need another sep.
Yes, but having a newline in a filename is sufficiently weird that I find it hard to imagine a scenario where "fix the filenames" isn't a better answer. Hence why I think the PEP needs to explain why the UNIX utilities considered this use case sufficiently non-obscure to add explicit support for it, rather than just assuming that the obviousness of the use case can be taken for granted. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Saturday, July 19, 2014 6:42 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 20 July 2014 11:31, Chris Angelico <rosuav@gmail.com> wrote:
On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At present, I'm genuinely unclear on why someone would ever want to pass the "-0" option to the other UNIX utilities, which then makes it very difficult to have a sensible discussion on how we should address that use case in Python.
That one's easy. What happens if you use 'find' to list files, and those files might have \n in their names? You need another sep.
Yes, but having a newline in a filename is sufficiently weird that I find it hard to imagine a scenario where "fix the filenames" isn't a better answer. Hence why I think the PEP needs to explain why the UNIX utilities considered this use case sufficiently non-obscure to add explicit support for it, rather than just assuming that the obviousness of the use case can be taken for granted.
First, why is it so odd to have newlines in filenames? It used to be pretty common on Classic Mac. Sure, they're not too common nowadays, but that's because they're illegal on DOS/Windows, and because the shell on Unix systems makes them a pain to deal with, not because there's something inherently nonsensical about the idea, any more than filenames with spaces or non-ASCII characters or >255 length. Second, "fix the filenames" is almost _never_ a better answer. If you're publishing a program for other people to use, you want to document that it won't work on some perfectly good files, and close their bugs as "Not a bug, rename your files if you want to use my software"? If the files are on a read-only filesystem or a slow tape backup, you really want to copy the entire filesystem over just so you can run a script on it? Also, even if "fix the filenames" were the right answer, you need to write a tool to do that, and why shouldn't it be possible to use Python for that tool? (In fact, one of the scripts I wanted this feature for is a replacement for the traditional rename tool (http://plasmasturm.org/code/rename/). I mainly wanted to let people use regular expressions without letting them run arbitrary Perl code, as rename -e does, but also, I couldn't figure out how to rename "foo" to "Foo" on a case-preserving-but-insensitive filesystem in Perl, and I know how to do it in Python.) At any rate, there are decades of tradition behind using -print0, and that's not going to change just because Python isn't as good as other languages at dealing with it. The GNU find documentation (http://linux.die.net/man/1/find) explicitly recommends, in multiple places, using -print0 instead of -print whenever possible. (For example, in the summary near the top, "If no expression is given, the expression -print is used (but you should probably consider using -print0 instead, anyway).") And part of the reason for that is that many other tools, like xargs, split on any whitespace, not on newlines, if not given the -0 argument. Fortunately, all of those tools know how to handle backslash escapes, but unfortunately, find doesn't know how to emit them. (Actually, frustratingly, both BSD and SysV find have the code to do it, but not in a way you can use here.) So, if you're writing a script that uses find and might get piped to anything that handles input like xargs, you have to use -print0. And that means, if you're writing a tool that might get find piped to it, you have to handle -print0, even if you're pretty sure nobody will ever have newlines for you to deal with, because they're probably going to want to use -print0 anyway, rather than figure out how your tool deals with other whitespace.
On 20 July 2014 13:58, Andrew Barnert <abarnert@yahoo.com> wrote:
First, why is it so odd to have newlines in filenames? It used to be pretty common on Classic Mac. Sure, they're not too common nowadays, but that's because they're illegal on DOS/Windows, and because the shell on Unix systems makes them a pain to deal with, not because there's something inherently nonsensical about the idea, any more than filenames with spaces or non-ASCII characters or >255 length.
You answered your own question: because DOS/Windows make them illegal, and the Unix shell isn't fond of them either. I was a DOS/Windows user for more than a decade before switching to Linux for personal use, and in a decade of using Linux (and even going to work for a Linux vendor), I've never encountered a filename with a newline in it. Thus the idea that anyone *would* do such a thing, and that it would be prevalent enough for UNIX tools to include a workaround in programs that normally produce newline separated output is an entirely novel concept for me. Any such file I encountered *would* be an outlier, and I'd likely be in a position to get the offending filename fixed rather than changing any data processing pipelines (whether written in Python or not) to tolerate newlines in filenames (since the cost differential between fixing one filename vs updating the data processing pipelines would be enormous). However, note that my attitude changed significantly once you clarified the use case - it's clear that there *is* a use case, it's just one that's outside my own personal experience. That's one of the things the PEP process is for - to explain such use cases to folks that haven't personally encountered them, and then explain why the proposed solution addresses the use case in a way that makes sense for the domains where the use case arises. The recent matrix multiplication PEP was an exemplary example of the breed. That's what I'm asking for here: a PEP that makes sense to someone like me for whom the idea of putting a newline in a filename is completely alien. Yes, it's technically permitted by the underlying operating system APIs on POSIX systems, but all the affordances at both the console and GUI level suggest "no newlines allowed". If you're coming from a DOS/Windows background (as I did), then the idea that a newline is technically a permitted filename character may never even occur to you (it certainly hadn't to me, and I'd never previously come across anything to challenge that assumption). Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Saturday, July 19, 2014 10:00 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
That's one of the things the PEP process is for - to explain such use cases to folks that haven't personally encountered them, and then explain why the proposed solution addresses the use case in a way that makes sense for the domains where the use case arises.
OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt It's probably a lot more detailed than necessary in many areas, but I figured it was better to include too much than to leave things ambiguous; after I know which parts are not contentious, I can strip it down in the next revision. Meanwhile, while writing it, and re-reading Guido's replies in this thread, I decided to come back to the alternative idea of exposing text files' buffers just like binary files' buffers. If done properly, that would make it much easier (still not trivial, but much easier) for users to just implement the readrecord functionality on their own, or for someone to package it up on PyPI. And I don't think the idea is as radical as it sounded at first, so I don't want it to be dismissed out of hand. So, also see http://bugs.python.org/file36009/pep-peek.txt Finally, writing this up made me recognize a couple of minor problems with the patch I'd been writing, and I don't think I have time to clean it up and write relevant tests now, so I might not be able to upload a useful patch until next weekend. Hopefully people can still discuss the PEP without a patch to play with.
On 21 July 2014 01:41, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt
As a suggestion, how about adding an example of a simple nul-separated filename filter - the sort of thing that could go in a find -print0 | xxx | xargs -0 pipeline? If I understand it, that's one of the key motivating examples for this change, so seeing how it's done would be a great help. Here's the sort of thing I mean, written for newline-separated files: import sys def process(filename): """Trivial example""" return filename.lower() if __name__ == '__main__': for filename in sys.stdin: filename = process(filename) print(filename) This is also an example of why I'm struggling to understand how an open() parameter "solves all the cases". There's no explicit open() call here, so how do you specify the record separator? Seeing how you propose this would work would be really helpful to me. Paul
Paul Moore <p.f.moore@gmail.com> writes:
On 21 July 2014 01:41, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt
As a suggestion, how about adding an example of a simple nul-separated filename filter - the sort of thing that could go in a find -print0 | xxx | xargs -0 pipeline? If I understand it, that's one of the key motivating examples for this change, so seeing how it's done would be a great help.
Here's the sort of thing I mean, written for newline-separated files:
import sys
def process(filename): """Trivial example""" return filename.lower()
if __name__ == '__main__':
for filename in sys.stdin: filename = process(filename) print(filename)
This is also an example of why I'm struggling to understand how an open() parameter "solves all the cases". There's no explicit open() call here, so how do you specify the record separator? Seeing how you propose this would work would be really helpful to me.
`find -print0 | ./tr-filename -0 | xargs -0` example implies that you can replace `sys.std*` streams without worrying about preserving `sys.__std*__` streams: #!/usr/bin/env python import io import re import sys from pathlib import Path def transform_filename(filename: str) -> str: # example """Normalize whitespace in basename.""" path = Path(filename) new_path = path.with_name(re.sub(r'\s+', ' ', path.name)) path.replace(new_path) # rename on disk if necessary return str(new_path) def SystemTextStream(bytes_stream, **kwargs): encoding = sys.getfilesystemencoding() return io.TextIOWrapper(bytes_stream, encoding=encoding, errors='surrogateescape' if encoding != 'mbcs' else 'strict', **kwargs) nl = '\0' if '-0' in sys.argv else None sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) for line in SystemTextStream(sys.stdin.detach(), newline=nl): print(transform_filename(line.rstrip(nl)), end=nl) io.TextIOWrapper() plays the role of open() in this case. The code assumes that `newline` parameter accepts '\0'. The example function handles Unicode whitespace to demonstrate why opaque bytes-based cookies can't be used to represent filenames in this case even on POSIX, though which characters are recognized depends on sys.getfilesystemencoding(). Note: - `end=nl` is necessary because `print()` prints '\n' by default -- it does not use `file.newline` - `-0` option is required in the current implementation if filenames may have a trailing whitespace. It can be improved - SystemTextStream() handles undecodable in the current locale filenames i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C) - undecodable filenames are not supported on Windows. It is not clear how to pass an undecodable filename via a pipe on Windows -- perhaps `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It assumes that the short path exists and it is always encodable using mbcs. If we can control all parts of the pipeline *and* Windows API uses proper utf-16 (not ucs-2) then utf-8 can be used to pass filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be tried e.g., https://github.com/Drekin/win-unicode-console -- Akira
On 22 July 2014 17:05, Akira Li <4kir4.1i@gmail.com> wrote:
The example function handles Unicode whitespace to demonstrate why opaque bytes-based cookies can't be used to represent filenames in this case even on POSIX, though which characters are recognized depends on sys.getfilesystemencoding().
Thanks. That's how you'd do it now. A question for the OP: how would the proposed change improve this code? Paul
Paul Moore <p.f.moore@gmail.com> writes:
On 22 July 2014 17:05, Akira Li <4kir4.1i@gmail.com> wrote:
The example function handles Unicode whitespace to demonstrate why opaque bytes-based cookies can't be used to represent filenames in this case even on POSIX, though which characters are recognized depends on sys.getfilesystemencoding().
Thanks. That's how you'd do it now.
You've cut too much e.g. I wrote in [1]:
io.TextIOWrapper() plays the role of open() in this case. The code assumes that `newline` parameter accepts '\0'.
[1] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html
A question for the OP: how would the proposed change improve this code? Paul
I'm not sure who is OP in this context but I can answer: the proposed change might allow TextIOWrapper(.., newline='\0') and the code in [1] doesn't support `-0` command-line parameter without it. -- Akira
On 23 July 2014 00:48, Akira Li <4kir4.1i@gmail.com> wrote:
I'm not sure who is OP in this context but I can answer: the proposed change might allow TextIOWrapper(.., newline='\0') and the code in [1] doesn't support `-0` command-line parameter without it.
I see. My apologies, I read that part but didn't spot what you meant. Thanks for clarifying.
On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i@gmail.com> wrote:
Paul Moore <p.f.moore@gmail.com> writes:
On 21 July 2014 01:41, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt
As a suggestion, how about adding an example of a simple nul-separated filename filter - the sort of thing that could go in a find -print0 | xxx | xargs -0 pipeline? If I understand it, that's one of the key motivating examples for this change, so seeing how it's done would be a great help.
Here's the sort of thing I mean, written for newline-separated files:
import sys
def process(filename): """Trivial example""" return filename.lower()
if __name__ == '__main__':
for filename in sys.stdin: filename = process(filename) print(filename)
This is also an example of why I'm struggling to understand how an open() parameter "solves all the cases". There's no explicit open() call here, so how do you specify the record separator? Seeing how you propose this would work would be really helpful to me.
`find -print0 | ./tr-filename -0 | xargs -0` example implies that you can replace `sys.std*` streams without worrying about preserving `sys.__std*__` streams:
#!/usr/bin/env python import io import re import sys from pathlib import Path
def transform_filename(filename: str) -> str: # example """Normalize whitespace in basename.""" path = Path(filename) new_path = path.with_name(re.sub(r'\s+', ' ', path.name)) path.replace(new_path) # rename on disk if necessary return str(new_path)
def SystemTextStream(bytes_stream, **kwargs): encoding = sys.getfilesystemencoding() return io.TextIOWrapper(bytes_stream, encoding=encoding, errors='surrogateescape' if encoding != 'mbcs' else 'strict', **kwargs)
nl = '\0' if '-0' in sys.argv else None sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) for line in SystemTextStream(sys.stdin.detach(), newline=nl): print(transform_filename(line.rstrip(nl)), end=nl)
Nice, much more complete example than mine. I just tried to handle as many edge cases as the original he asked about, but you handle everything.
io.TextIOWrapper() plays the role of open() in this case. The code assumes that `newline` parameter accepts '\0'.
The example function handles Unicode whitespace to demonstrate why opaque bytes-based cookies can't be used to represent filenames in this case even on POSIX, though which characters are recognized depends on sys.getfilesystemencoding().
Note:
- `end=nl` is necessary because `print()` prints '\n' by default -- it does not use `file.newline`
Actually, yes it does. Or, rather, print pastes on a '\n', but sys.stdout.write translates any '\n' characters to sys.stdout.writenl (a private variable that's initialized from the newline argument at construction time if it's anything other than None or ''). But of course that's the newline argument to sys.stdout, and you only changed sys.stdin, so you do need end=nl anyway. (And you wouldn't want output translation here anyway, because that could also translate '\n' characters in the middle of a line, re-creating the same problem we're trying to avoid...) But it uses sys.stdout.newline, not sys.stdin.newline.
- `-0` option is required in the current implementation if filenames may have a trailing whitespace. It can be improved - SystemTextStream() handles undecodable in the current locale filenames i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C) - undecodable filenames are not supported on Windows. It is not clear how to pass an undecodable filename via a pipe on Windows -- perhaps `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It assumes that the short path exists and it is always encodable using mbcs. If we can control all parts of the pipeline *and* Windows API uses proper utf-16 (not ucs-2) then utf-8 can be used to pass filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be tried e.g., https://github.com/Drekin/win-unicode-console
First, don't both the Win32 APIs and the POSIX-ish layer in msvcrt on top of it guarantee that you can never get such unencodable filenames (sometimes by just pretending the file doesn't exist, but if possible by having the filesystem map it to something valid, unique, and persistent for this session, usually the short name)? Second, trying to solve this implies that you have some other native (as opposed to Cygwin) tool that passes or accepts such filenames over simple pipes (as opposed to PowerShell typed ones). Are there any? What does, say, mingw's find do with invalid filenames if it finds them? On Unix, of course, it's a real problem.
Andrew Barnert <abarnert@yahoo.com> writes:
On Jul 22, 2014, at 9:05, Akira Li <4kir4.1i@gmail.com> wrote:
Paul Moore <p.f.moore@gmail.com> writes:
On 21 July 2014 01:41, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt
As a suggestion, how about adding an example of a simple nul-separated filename filter - the sort of thing that could go in a find -print0 | xxx | xargs -0 pipeline? If I understand it, that's one of the key motivating examples for this change, so seeing how it's done would be a great help.
Here's the sort of thing I mean, written for newline-separated files:
import sys
def process(filename): """Trivial example""" return filename.lower()
if __name__ == '__main__':
for filename in sys.stdin: filename = process(filename) print(filename)
This is also an example of why I'm struggling to understand how an open() parameter "solves all the cases". There's no explicit open() call here, so how do you specify the record separator? Seeing how you propose this would work would be really helpful to me.
`find -print0 | ./tr-filename -0 | xargs -0` example implies that you can replace `sys.std*` streams without worrying about preserving `sys.__std*__` streams:
#!/usr/bin/env python import io import re import sys from pathlib import Path
def transform_filename(filename: str) -> str: # example """Normalize whitespace in basename.""" path = Path(filename) new_path = path.with_name(re.sub(r'\s+', ' ', path.name)) path.replace(new_path) # rename on disk if necessary return str(new_path)
def SystemTextStream(bytes_stream, **kwargs): encoding = sys.getfilesystemencoding() return io.TextIOWrapper(bytes_stream, encoding=encoding, errors='surrogateescape' if encoding != 'mbcs' else 'strict', **kwargs)
nl = '\0' if '-0' in sys.argv else None sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) for line in SystemTextStream(sys.stdin.detach(), newline=nl): print(transform_filename(line.rstrip(nl)), end=nl)
Nice, much more complete example than mine. I just tried to handle as many edge cases as the original he asked about, but you handle everything.
io.TextIOWrapper() plays the role of open() in this case. The code assumes that `newline` parameter accepts '\0'.
The example function handles Unicode whitespace to demonstrate why opaque bytes-based cookies can't be used to represent filenames in this case even on POSIX, though which characters are recognized depends on sys.getfilesystemencoding().
Note:
- `end=nl` is necessary because `print()` prints '\n' by default -- it does not use `file.newline`
Actually, yes it does. Or, rather, print pastes on a '\n', but sys.stdout.write translates any '\n' characters to sys.stdout.writenl (a private variable that's initialized from the newline argument at construction time if it's anything other than None or '').
You are right. I've stopped reading the source for print() function at `PyFile_WriteString("\n", file);` line assuming that "\n" is not translated if newline="\0". But the current behaviour if "\0" were in "the other legal values" category (like "\r") would be to translate "\n" [1]: When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string. [1] https://docs.python.org/3/library/io.html#io.TextIOWrapper Example: $ ./python -c 'import sys, io; sys.stdout=io.TextIOWrapper(sys.stdout.detach(), newline="\r\n"); sys.stdout.write("\n\r\r\n")'| xxd 0000000: 0d0a 0d0d 0d0a ...... "\n" is translated to b"\r\n" here and "\r" is left untouched (b"\r"). In order to newline="\0" case to work, it should behave similar to newline='' or newline='\n' case instead i.e., no translation should take place, to avoid corrupting embed "\n\r" characters. My original code works as is in this case i.e., *end=nl is still necessary*.
But of course that's the newline argument to sys.stdout, and you only changed sys.stdin, so you do need end=nl anyway. (And you wouldn't want output translation here anyway, because that could also translate \n' characters in the middle of a line, re-creating the same problem we're trying to avoid...)
But it uses sys.stdout.newline, not sys.stdin.newline.
The code affects *both* sys.stdout/sys.stdin. Look [2]:
sys.stdout = SystemTextStream(sys.stdout.detach(), newline=nl) for line in SystemTextStream(sys.stdin.detach(), newline=nl): print(transform_filename(line.rstrip(nl)), end=nl)
[2] https://mail.python.org/pipermail/python-ideas/2014-July/028372.html
- SystemTextStream() handles undecodable in the current locale filenames i.e., non-ascii names are allowed even in C locale (LC_CTYPE=C) - undecodable filenames are not supported on Windows. It is not clear how to pass an undecodable filename via a pipe on Windows -- perhaps `GetShortPathNameW -> fsencode -> pipe` might work in some cases. It assumes that the short path exists and it is always encodable using mbcs. If we can control all parts of the pipeline *and* Windows API uses proper utf-16 (not ucs-2) then utf-8 can be used to pass filenames via a pipe otherwise ReadConsoleW/WriteConsoleW could be tried e.g., https://github.com/Drekin/win-unicode-console
First, don't both the Win32 APIs and the POSIX-ish layer in msvcrt on top of it guarantee that you can never get such unencodable filenames (sometimes by just pretending the file doesn't exist, but if possible by having the filesystem map it to something valid, unique, and persistent for this session, usually the short name)? Second, trying to solve this implies that you have some other native (as opposed to Cygwin) tool that passes or accepts such filenames over simple pipes (as opposed to PowerShell typed ones). Are there any? What does, say, mingw's find do with invalid filenames if it finds them?
In short: I don't know :) To be clear, I'm talking about native Windows applications (not find/xargs on Cygwin). The goal is to process robustly *arbitrary* filenames on Windows via a pipe (SystemTextStream()) or network (bytes interface). I know that (A)nsi API (and therefore "POSIX-ish layer" that uses narrow strings such main(), fopen(), fstream is broken e.g., Thai filenames on Greek computer [3]. Unicode (W) API should enforce utf-16 in principle since Windows 2000 [4]. But I expect ucs-2 shows its ugly head in many places due to bad programming practices (based on the common wrong assumption that Unicode == UTF-16 == UCS-2) and/or bugs that are not fixed due to MS' backwards compatibility policies in the past [5]. [3] http://blog.gatunka.com/2014/04/25/character-encodings-for-modern-programmer... [4] http://en.wikipedia.org/wiki/UTF-16#Use_in_major_operating_systems_and_envir... [5] http://blogs.msdn.com/b/oldnewthing/archive/2003/10/15/55296.aspx -- Akira
On Jul 21, 2014, at 0:04, Paul Moore <p.f.moore@gmail.com> wrote:
On 21 July 2014 01:41, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
OK, I wrote up a draft PEP, and attached it to the bug (if that's not a good thing to do, apologies); you can find it at http://bugs.python.org/file36008/pep-newline.txt
As a suggestion, how about adding an example of a simple nul-separated filename filter - the sort of thing that could go in a find -print0 | xxx | xargs -0 pipeline? If I understand it, that's one of the key motivating examples for this change, so seeing how it's done would be a great help.
Here's the sort of thing I mean, written for newline-separated files:
import sys
def process(filename): """Trivial example""" return filename.lower()
if __name__ == '__main__':
for filename in sys.stdin: filename = process(filename) print(filename)
for file in io.TextIOWrapper(sys.stdin.buffer, encoding=sys.stdin.encoding, errors=sys.stdin.errors, newline='\0'): filename = process(filename.rstrip('\0')) print(filename) I assume you wanted an rstrip('\n') in the original, so I did the equivalent here. If you want to pipe the result to another -0 tool, you also need to add end='\0' to the print, of course. If we had Nick Coghlan's separate idea of adding rewrap methods to the stream classes (not part of this proposal, but I would be happy to have it), it would be even simpler: for file in sys.stdin.rewrap(newline='\0'): filename = process(filename.rstrip('\0')) print(filename) Anyway, this isn't perfect if, e.g., you might have illegal-as-UTF8 Latin-1 filenames hiding in your UTF8 filesystem, but neither is your code; in fact, this does exactly the same thing, except that it takes \0 terminators (so it can handle filenames with embedded newlines, or pipelines that use -print0 just because they can't be sure which tools in the chain can handle spaces). It's obviously a little more complicated than your code, but that's to be expected; it's a lot simpler than anything we can write today. (And it runs at the same speed of your code instead of 2x slower or worse.)
This is also an example of why I'm struggling to understand how an open() parameter "solves all the cases". There's no explicit open() call here, so how do you specify the record separator? Seeing how you propose this would work would be really helpful to me.
The open function is just a shortcut to constructing a stack of io classes; you can always construct them manually. It would be nice if some cases of that were made a little easier (again, see Nick's proposal above), but it's easy enough to live with.
On 23 July 2014 05:24, Andrew Barnert <abarnert@yahoo.com> wrote:
This is also an example of why I'm struggling to understand how an open() parameter "solves all the cases". There's no explicit open() call here, so how do you specify the record separator? Seeing how you propose this would work would be really helpful to me.
The open function is just a shortcut to constructing a stack of io classes;
Ah, yes, I get what you're saying now. I was reading your proposal too literally as being about "open", and forgetting you can use the underlying classes to rewrap existing streams. Thanks for your patience. Paul
Nick Coghlan wrote:
having a newline in a filename is sufficiently weird that I find it hard to imagine a scenario where "fix the filenames" isn't a better answer.
In Classic MacOS, the way you gave a folder an icon was to put it in a hidden file called "Icon\r". -- Greg
The pattern I use, by far, most often with the -0 option is: find $path -print0 | xargs -0 some_command Embedding a '\n' in a filename might be weird, but having whitespace in general (i.e. spaces) really isn't uncommon. However, in this case it doesn't really seem to matter if some_command is some_command.py. But I still think the null byte special delimiter is plausible for similar pipelines. On Sat, Jul 19, 2014 at 6:40 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 20 July 2014 11:31, Chris Angelico <rosuav@gmail.com> wrote:
On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At present, I'm genuinely unclear on why someone would ever want to pass the "-0" option to the other UNIX utilities, which then makes it very difficult to have a sensible discussion on how we should address that use case in Python.
That one's easy. What happens if you use 'find' to list files, and those files might have \n in their names? You need another sep.
Yes, but having a newline in a filename is sufficiently weird that I find it hard to imagine a scenario where "fix the filenames" isn't a better answer. Hence why I think the PEP needs to explain why the UNIX utilities considered this use case sufficiently non-obscure to add explicit support for it, rather than just assuming that the obviousness of the use case can be taken for granted.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 20 Jul 2014, at 03:40, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 20 July 2014 11:31, Chris Angelico <rosuav@gmail.com> wrote:
On Sun, Jul 20, 2014 at 11:23 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At present, I'm genuinely unclear on why someone would ever want to pass the "-0" option to the other UNIX utilities, which then makes it very difficult to have a sensible discussion on how we should address that use case in Python.
That one's easy. What happens if you use 'find' to list files, and those files might have \n in their names? You need another sep.
Yes, but having a newline in a filename is sufficiently weird that I find it hard to imagine a scenario where "fix the filenames" isn't a better answer.
Because you are likely to have no control af all over what people do with filenames. Since, on POSIX at least, filenames are allowed to contain all characters other than NUL and / you must be able to deal with that. Similar to how you must also be able to deal with a mixture of filenames using different encodings or even pure binary names. Wichert.
Chris Angelico writes:
But they might well be the same thing. Look at all the Unix commands that usually separate output with \n, but can be told to separate with \0 instead. If you're reading from something like that, it should be just as easy to split on \n as on \0.
Nick's point is more general, I think, but as a special case consider a "multiline" record. What's the right behavior on output from the application if the newline convention of this particular multiline differs from that of the rest of the output stream? IMO this goes beyond "consenting adults" (YMMV, of course). Steve
On 19.07.2014 09:10, Nick Coghlan wrote:
I still favour my proposal there to add a separate "readrecords()" method, rather than reusing the line based iteration methods - lines and arbitrary records *aren't* the same thing, and I don't think we'd be doing anybody any favours by conflating them (whether we're confusing them at the method level or at the constructor argument level).
Thinking about possible use-cases for my own work, made me realize one thing: At least for text files, the distinction between records and lines, in practical terms, is that records may have *internal structure based on newline characters*, while lines are just lines. If a future readrecords() method would return the record as a StringIO or BytesIO object, this would allow nested reading of files as lines (with full newline processing) within records: for record in infile.readrecords(): for line in record: do_something() For me, that sort of feature is a more common requirement than being able to retrieve single lines terminated by something else than newline characters. Maybe though, it's possible to have both, a readrecords method like the one above and an extended set of "newline" tokens that can be passed to open (at least allowing "\0" seems to make sense). Best, Wolfgang
On Thursday, July 17, 2014 1:48 PM, Guido van Rossum <guido@python.org> wrote:
I think it's fine to add something to stdlib that encapsulates your example. (TBD: where?)
Good question about the where. The resplit function seems like it could be of more general use than just this case, but I'm not sure where it belongs. Maybe itertools? The iter(lambda: f.read(bufsize), b'') part seems too trivial to put anywhere, even just as an example in the docs—but given that it probably looks like a magic incantation to anyone who's a Python novice (even if they're a C or JS or whatever expert), maybe it is worth putting somewhere. Maybe io.iterchunks(f, 4096)? If so, the combination of the two into something like iterlines(f, b'\0') seems like it should go right alongside iterchunks. However…
I don't think it is reasonable to add a new parameter to readline()
The problem is that my code has significant problems for many use cases, and I don't think they can be solved. Calling readline (or iterating the file) uses the underlying buffer (and stream decoder, for text files), keeps the file pointer in the same place, etc. My code doesn't, and no external code can. So, besides being less efficient, it leaves the file pointer in the wrong place (imagine using it to parse an RFC822 header then read() the body), doesn't properly decode files where the separator can be ambiguous with other bytes (try separating on '\0' in a UTF-16 file), etc. Maybe if we had more powerful adapters or wrappers so I could just say "here's a pre-existing buffer plus a text-file-like object, now wrap that up as a real TextIOBase for me" it would be possible to write something that worked from outside without these problems, but as things stand, I don't see an answer. Maybe put resplit in the stdlib, then just give iterlines as a 2-liner example (in the itertools recipes, or the file-I/O section of the tutorial?) where all these problems can be raised and not answered?
On Thu, Jul 17, 2014 at 2:59 PM, Andrew Barnert < abarnert@yahoo.com.dmarc.invalid> wrote:
On Thursday, July 17, 2014 1:48 PM, Guido van Rossum <guido@python.org> wrote:
I think it's fine to add something to stdlib that encapsulates your example. (TBD: where?)
Good question about the where.
The resplit function seems like it could be of more general use than just this case, but I'm not sure where it belongs. Maybe itertools?
The iter(lambda: f.read(bufsize), b'') part seems too trivial to put anywhere, even just as an example in the docs—but given that it probably looks like a magic incantation to anyone who's a Python novice (even if they're a C or JS or whatever expert), maybe it is worth putting somewhere. Maybe io.iterchunks(f, 4096)?
If so, the combination of the two into something like iterlines(f, b'\0') seems like it should go right alongside iterchunks.
However…
I don't think it is reasonable to add a new parameter to readline()
The problem is that my code has significant problems for many use cases, and I don't think they can be solved.
Calling readline (or iterating the file) uses the underlying buffer (and stream decoder, for text files), keeps the file pointer in the same place, etc. My code doesn't, and no external code can. So, besides being less efficient, it leaves the file pointer in the wrong place (imagine using it to parse an RFC822 header then read() the body), doesn't properly decode files where the separator can be ambiguous with other bytes (try separating on '\0' in a UTF-16 file), etc.
You can implement a subclass of io.BufferedIOBase that wraps an instance of io.RawIOBase (I think those are the right classes) where the wrapper adds a readuntil(separator) method. Whichever thing then wants to read the rest of the data should call read() on the wrapper object. This still sounds a lot better to me than asking everyone to add a new parameter to their readline() (and the implementation). Maybe if we had more powerful adapters or wrappers so I could just say
"here's a pre-existing buffer plus a text-file-like object, now wrap that up as a real TextIOBase for me" it would be possible to write something that worked from outside without these problems, but as things stand, I don't see an answer.
You probably have to do a separate wrapper for text streams, the types and buffering implementation are just too different.
Maybe put resplit in the stdlib, then just give iterlines as a 2-liner example (in the itertools recipes, or the file-I/O section of the tutorial?) where all these problems can be raised and not answered?
(Sorry, in a hurry / terribly distracted.) -- --Guido van Rossum (python.org/~guido)
On Jul 17, 2014, at 15:37, Guido van Rossum <guido@python.org> wrote:
On Thu, Jul 17, 2014 at 2:59 PM, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:
I don't think it is reasonable to add a new parameter to readline()
The problem is that my code has significant problems for many use cases, and I don't think they can be solved.
Calling readline (or iterating the file) uses the underlying buffer (and stream decoder, for text files), keeps the file pointer in the same place, etc. My code doesn't, and no external code can. So, besides being less efficient, it leaves the file pointer in the wrong place (imagine using it to parse an RFC822 header then read() the body), doesn't properly decode files where the separator can be ambiguous with other bytes (try separating on '\0' in a UTF-16 file), etc.
You can implement a subclass of io.BufferedIOBase that wraps an instance of io.RawIOBase (I think those are the right classes) where the wrapper adds a readuntil(separator) method. Whichever thing then wants to read the rest of the data should call read() on the wrapper object.
This still sounds a lot better to me than asking everyone to add a new parameter to their readline() (and the implementation).
[snip]
You probably have to do a separate wrapper for text streams, the types and buffering implementation are just too different.
The problem isn't needing two separate wrappers, it's that the text wrapper if effectively impossible. For binary files, MyBufferedReader.readuntil is a slightly modified version of _pyio.RawIOBase.readline, which only needs to access the public interface of io.BufferedReader (peek and read). For text files, however, it needs to access private information from TextIOWrapper that isn't exposed from C to Python. And, unlike BufferedReader, TextIOWrapper has no way to peek ahead, or push data back onto the buffer, or anything else usable as a workaround, so even if you wanted to try to take care of the decoding state problems manually, you can't, except by reading one character at a time. There are also some minor problems even for binary files (e.g., MyBufferedReader(f.raw) has a different file position from f, so if you switch between them you'll end up skipping part of the file), but these won't affect most use cases; the text file problem is the big one.
participants (17)
-
Akira Li
-
Alexander Heger
-
Andrew Barnert
-
Antoine Pitrou
-
Chris Angelico
-
Clint Hepner
-
David Mertz
-
Greg Ewing
-
Guido van Rossum
-
Juancarlo Añez
-
MRAB
-
Nick Coghlan
-
Paul Moore
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Wichert Akkerman
-
Wolfgang Maier