I have a use case where I'm writing a small filter in Python which should be able to take output from the find command to filter and optionally pass it to other commands (like xargs), but also can be passed specific filenames for the input and/or output. Find can output it's filenames in null-terminated lines since it is possible to have newlines in a filename(yuck). There are workarounds that I can use to get the same thing done, essentially using TextIOWrapper in '' mode and scanning for any null terminators in the lines/splitting manually, but it would be nice if the TextIOWrapper supported null characters as line terminators for such use cases if possible.
Something line this:
# configure output stream
if(arg.input): input_stream = open(arg.input, "rt", newline="\000" if arg.null_lines else None) else: input_stream = sys.stdin input_stream.reconfigure(newline="\000" if arg.null_lines else None)
# configure input stream
Because my use case also has the ability to specify an encoding for the input/output, my workaround was originally to use codec.getreader/getwriter which turned out to be rather slow (the output of the pipe to the filter was in bursts), so since I'm currently using 3.5 and can't use reconfigure, currently involves detaching sys.stdin and reattaching to a new TextIOWrapper, setting buffering to 0 and newline to '' in null-terminated mode to get all the characters untranslated and manually scanning/splitting on null characters, though I'm sure there are better way to do this.
Thanks,
Brian Vanderburg II
On 2020-10-24 at 12:29:01 -0400, Brian Allen Vanderburg II via Python-ideas python-ideas@python.org wrote:
... Find can output it's filenames in null-terminated lines since it is possible to have newlines in a filename(yuck) ...
Spaces in filenames are just as bad, and much more common:
$ touch 'foo bar'
$ find . -name 'foo bar'
./foo bar
$ find . -name 'foo bar' -print | xargs ls -l
ls: cannot access './foo': No such file or directory
ls: cannot access 'bar': No such file or directory
$ find . -name 'foo bar' -print0 | xargs -0 ls -l
-rw-r--r-- 1 dan dan 0 Oct 24 13:31 './foo bar'
$ rm 'foo bar'
On 24Oct2020 13:37, Dan Sommers 2QdxY4RzWzUUiLuE@potatochowder.com wrote:
On 2020-10-24 at 12:29:01 -0400, Brian Allen Vanderburg II via Python-ideas python-ideas@python.org wrote:
... Find can output it's filenames in null-terminated lines since it is possible to have newlines in a filename(yuck) ...
Spaces in filenames are just as bad, and much more common:
But much easier to handle in simple text listings, which are newline delimited.
You're really running into a horrible behaviour from xargs, which is one reason why GNU parallel exists.
Cheers, Cameron Simpson cs@cskk.id.au
On Mon, Oct 26, 2020 at 8:44 AM Cameron Simpson cs@cskk.id.au wrote: >
On 24Oct2020 13:37, Dan Sommers 2QdxY4RzWzUUiLuE@potatochowder.com wrote:
On 2020-10-24 at 12:29:01 -0400, Brian Allen Vanderburg II via Python-ideas python-ideas@python.org wrote:
... Find can output it's filenames in null-terminated lines since it is possible to have newlines in a filename(yuck) ...
Spaces in filenames are just as bad, and much more common:
But much easier to handle in simple text listings, which are newline delimited.
You're really running into a horrible behaviour from xargs, which is one reason why GNU parallel exists.
I don't consider the behaviour horrible, and xargs isn't the only thing to do this - other tools can be put into zero-termination mode too.
But it's pretty rare to consume huge amounts of data in this way (normally it'll just be a list of file names), so what I would do is simply read the entire thing, then split on "\0". It's not like reading a gigabyte of log file, where you really want to work line by line and not read in more than you need; it's easily going to fit into memory.
If you actually DO need to read null-terminated records from a file that's too big for memory, it's probably worth just rolling your own buffering, reading a chunk at a time and splitting off the interesting parts. It's not hugely difficult, and it's a good exercise to do now and then. And yes, I can see the temptation to get Python to do it, but unfortunately, newline support is such a weird mess of cross-platform support that I don't think it needs to be made more complicated :)
ChrisA
On Sun, Oct 25, 2020, at 18:45, Chris Angelico wrote:
If you actually DO need to read null-terminated records from a file that's too big for memory, it's probably worth just rolling your own buffering, reading a chunk at a time and splitting off the interesting parts. It's not hugely difficult, and it's a good exercise to do now and then. And yes, I can see the temptation to get Python to do it, but unfortunately, newline support is such a weird mess of cross-platform support that I don't think it needs to be made more complicated :)
Maybe a getdelim method that ignores all the newline support complexity and just reads until it reaches the specified character? It would make sense on binary files too.
The problem with rolling your own buffering is that there's not really a good way to put back the unused data after the delimiter if you're mixing this processing with something else. You'd have to do it a character at a time, which would be very inefficient in pure python.
On 26Oct2020 09:45, Chris Angelico rosuav@gmail.com wrote:
On Mon, Oct 26, 2020 at 8:44 AM Cameron Simpson cs@cskk.id.au wrote:
On 24Oct2020 13:37, Dan Sommers 2QdxY4RzWzUUiLuE@potatochowder.com wrote:
Spaces in filenames are just as bad, and much more common:
But much easier to handle in simple text listings, which are newline delimited. You're really running into a horrible behaviour from xargs, which is one reason why GNU parallel exists.
I don't consider the behaviour horrible, and xargs isn't the only thing to do this - other tools can be put into zero-termination mode too.
I'm not talking about -print0 and -0, which I merely dislike as a hack to accomodate badly named filenames, but xargs' non-0 behaviour, which splits on whitespace. Instead of newlines. That pissed me off enough to write my own.
[...]
If you actually DO need to read null-terminated records from a file that's too big for memory, it's probably worth just rolling your own buffering, reading a chunk at a time and splitting off the interesting parts. It's not hugely difficult, and it's a good exercise to do now and then.
Aye. That's what my cs.buffer.CornuCopyBuffer class does for me:
https://pypi.org/project/cs.buffer/
aimed particularly at parsing binary data easily (it takes any iterable of bytes, and has a few factories to start from a file etc).
Parsing a NUL terminated string from binary data isn't too bad given such a thing.
Cheers, Cameron Simpson cs@cskk.id.au
On Mon, Oct 26, 2020 at 10:47 AM Cameron Simpson cs@cskk.id.au wrote: >
On 26Oct2020 09:45, Chris Angelico rosuav@gmail.com wrote:
On Mon, Oct 26, 2020 at 8:44 AM Cameron Simpson cs@cskk.id.au wrote:
On 24Oct2020 13:37, Dan Sommers 2QdxY4RzWzUUiLuE@potatochowder.com wrote:
Spaces in filenames are just as bad, and much more common:
But much easier to handle in simple text listings, which are newline delimited. You're really running into a horrible behaviour from xargs, which is one reason why GNU parallel exists.
I don't consider the behaviour horrible, and xargs isn't the only thing to do this - other tools can be put into zero-termination mode too.
I'm not talking about -print0 and -0, which I merely dislike as a hack to accomodate badly named filenames, but xargs' non-0 behaviour, which splits on whitespace. Instead of newlines. That pissed me off enough to write my own.
Ohh, I see what you mean. Yeah, newlines would be a better default for a lot of situations. Can't be changed now.
ChrisA