TextIOWrapper support for null-terminated lines

I have a use case where I'm writing a small filter in Python which should be able to take output from the find command to filter and optionally pass it to other commands (like xargs), but also can be passed specific filenames for the input and/or output. Find can output it's filenames in null-terminated lines since it is possible to have newlines in a filename(yuck). There are workarounds that I can use to get the same thing done, essentially using TextIOWrapper in '' mode and scanning for any null terminators in the lines/splitting manually, but it would be nice if the TextIOWrapper supported null characters as line terminators for such use cases if possible. Something line this: # configure output stream if(arg.input): input_stream = open(arg.input, "rt", newline="\000" if arg.null_lines else None) else: input_stream = sys.stdin input_stream.reconfigure(newline="\000" if arg.null_lines else None) # configure input stream Because my use case also has the ability to specify an encoding for the input/output, my workaround was originally to use codec.getreader/getwriter which turned out to be rather slow (the output of the pipe to the filter was in bursts), so since I'm currently using 3.5 and can't use reconfigure, currently involves detaching sys.stdin and reattaching to a new TextIOWrapper, setting buffering to 0 and newline to '' in null-terminated mode to get all the characters untranslated and manually scanning/splitting on null characters, though I'm sure there are better way to do this. Thanks, Brian Vanderburg II

On 2020-10-24 at 12:29:01 -0400, Brian Allen Vanderburg II via Python-ideas <python-ideas@python.org> wrote:
... Find can output it's filenames in null-terminated lines since it is possible to have newlines in a filename(yuck) ...
Spaces in filenames are just as bad, and much more common: $ touch 'foo bar' $ find . -name 'foo bar' ./foo bar $ find . -name 'foo bar' -print | xargs ls -l ls: cannot access './foo': No such file or directory ls: cannot access 'bar': No such file or directory $ find . -name 'foo bar' -print0 | xargs -0 ls -l -rw-r--r-- 1 dan dan 0 Oct 24 13:31 './foo bar' $ rm 'foo bar'

On 24Oct2020 13:37, Dan Sommers <2QdxY4RzWzUUiLuE@potatochowder.com> wrote:
But much easier to handle in simple text listings, which are newline delimited. You're really running into a horrible behaviour from xargs, which is one reason why GNU parallel exists. Cheers, Cameron Simpson <cs@cskk.id.au>

On Mon, Oct 26, 2020 at 8:44 AM Cameron Simpson <cs@cskk.id.au> wrote:
I don't consider the behaviour horrible, and xargs isn't the only thing to do this - other tools can be put into zero-termination mode too. But it's pretty rare to consume huge amounts of data in this way (normally it'll just be a list of file names), so what I would do is simply read the entire thing, then split on "\0". It's not like reading a gigabyte of log file, where you really want to work line by line and not read in more than you need; it's easily going to fit into memory. If you actually DO need to read null-terminated records from a file that's too big for memory, it's probably worth just rolling your own buffering, reading a chunk at a time and splitting off the interesting parts. It's not hugely difficult, and it's a good exercise to do now and then. And yes, I can see the temptation to get Python to do it, but unfortunately, newline support is such a weird mess of cross-platform support that I don't think it needs to be made more complicated :) ChrisA

On Sun, Oct 25, 2020, at 18:45, Chris Angelico wrote:
Maybe a getdelim method that ignores all the newline support complexity and just reads until it reaches the specified character? It would make sense on binary files too. The problem with rolling your own buffering is that there's not really a good way to put back the unused data after the delimiter if you're mixing this processing with something else. You'd have to do it a character at a time, which would be very inefficient in pure python.

On 26Oct2020 09:45, Chris Angelico <rosuav@gmail.com> wrote:
I'm not talking about -print0 and -0, which I merely dislike as a hack to accomodate badly named filenames, but xargs' non-0 behaviour, which splits on whitespace. Instead of newlines. That pissed me off enough to write my own. [...]
Aye. That's what my cs.buffer.CornuCopyBuffer class does for me: https://pypi.org/project/cs.buffer/ aimed particularly at parsing binary data easily (it takes any iterable of bytes, and has a few factories to start from a file etc). Parsing a NUL terminated string from binary data isn't too bad given such a thing. Cheers, Cameron Simpson <cs@cskk.id.au>

On 2020-10-24 at 12:29:01 -0400, Brian Allen Vanderburg II via Python-ideas <python-ideas@python.org> wrote:
... Find can output it's filenames in null-terminated lines since it is possible to have newlines in a filename(yuck) ...
Spaces in filenames are just as bad, and much more common: $ touch 'foo bar' $ find . -name 'foo bar' ./foo bar $ find . -name 'foo bar' -print | xargs ls -l ls: cannot access './foo': No such file or directory ls: cannot access 'bar': No such file or directory $ find . -name 'foo bar' -print0 | xargs -0 ls -l -rw-r--r-- 1 dan dan 0 Oct 24 13:31 './foo bar' $ rm 'foo bar'

On 24Oct2020 13:37, Dan Sommers <2QdxY4RzWzUUiLuE@potatochowder.com> wrote:
But much easier to handle in simple text listings, which are newline delimited. You're really running into a horrible behaviour from xargs, which is one reason why GNU parallel exists. Cheers, Cameron Simpson <cs@cskk.id.au>

On Mon, Oct 26, 2020 at 8:44 AM Cameron Simpson <cs@cskk.id.au> wrote:
I don't consider the behaviour horrible, and xargs isn't the only thing to do this - other tools can be put into zero-termination mode too. But it's pretty rare to consume huge amounts of data in this way (normally it'll just be a list of file names), so what I would do is simply read the entire thing, then split on "\0". It's not like reading a gigabyte of log file, where you really want to work line by line and not read in more than you need; it's easily going to fit into memory. If you actually DO need to read null-terminated records from a file that's too big for memory, it's probably worth just rolling your own buffering, reading a chunk at a time and splitting off the interesting parts. It's not hugely difficult, and it's a good exercise to do now and then. And yes, I can see the temptation to get Python to do it, but unfortunately, newline support is such a weird mess of cross-platform support that I don't think it needs to be made more complicated :) ChrisA

On Sun, Oct 25, 2020, at 18:45, Chris Angelico wrote:
Maybe a getdelim method that ignores all the newline support complexity and just reads until it reaches the specified character? It would make sense on binary files too. The problem with rolling your own buffering is that there's not really a good way to put back the unused data after the delimiter if you're mixing this processing with something else. You'd have to do it a character at a time, which would be very inefficient in pure python.

On 26Oct2020 09:45, Chris Angelico <rosuav@gmail.com> wrote:
I'm not talking about -print0 and -0, which I merely dislike as a hack to accomodate badly named filenames, but xargs' non-0 behaviour, which splits on whitespace. Instead of newlines. That pissed me off enough to write my own. [...]
Aye. That's what my cs.buffer.CornuCopyBuffer class does for me: https://pypi.org/project/cs.buffer/ aimed particularly at parsing binary data easily (it takes any iterable of bytes, and has a few factories to start from a file etc). Parsing a NUL terminated string from binary data isn't too bad given such a thing. Cheers, Cameron Simpson <cs@cskk.id.au>
participants (5)
-
2QdxY4RzWzUUiLuE@potatochowder.com
-
Brian Allen Vanderburg II
-
Cameron Simpson
-
Chris Angelico
-
Random832