Mailman 3 Subprocess: Add an encoding argument - Python-ideas

Subprocess: Add an encoding argument

Paul Moore

Aug. 29, 2014

7:51 a.m.

At the moment, subprocess offers two options for handlingthe standard IO streams of the child. By default, the streams are binary, or you can set universal_newlines to get text-mode streams with universal newline handling enabled. With universal_newlines, the encoding of the streams is the default value for the environment (whatever locale.getpreferredencoding() returns). However, there can be cases where you want finer control over the encoding to use (for example, if you run Python in a subprocess and set PYTHONIOENCODING). I propose adding an "encoding" parameter to subprocess.Popen (and the various wrapper routines) to allow specifying the actual encoding to use. Obviously, you can simply wrap the binary streams yourself - the main use for this facility would be in the higher level functions like check_output and communicate. Does this seem like a reasonable suggestion? Paul

Show replies by date

Nick Coghlan

August 2014

8:39 a.m.

On 29 Aug 2014 17:52, "Paul Moore" <p.f.moore@gmail.com> wrote:

...

At the moment, subprocess offers two options for handlingthe standard IO streams of the child. By default, the streams are binary, or you can set universal_newlines to get text-mode streams with universal newline handling enabled.

With universal_newlines, the encoding of the streams is the default value for the environment (whatever locale.getpreferredencoding() returns). However, there can be cases where you want finer control over the encoding to use (for example, if you run Python in a subprocess and set PYTHONIOENCODING).

I propose adding an "encoding" parameter to subprocess.Popen (and the various wrapper routines) to allow specifying the actual encoding to use.

Obviously, you can simply wrap the binary streams yourself - the main use for this facility would be in the higher level functions like check_output and communicate.

Does this seem like a reasonable suggestion?

This actually gets a little messy once you start digging into it, as you actually have up to 3 streams to deal with (stdin, stdout, stderr), and may want to set the error handler in addition to the encoding. http://bugs.python.org/issue6135 has the many gory details. It's a problem that definitely needs solving, but it may be better approached by making it easier to create the pipes *separately*, and then pass relevant details into the subprocess call. As with win_unicode_console (and even contextlib2), it's probably worth experimenting in a PyPI module, as that will make it easier for people to try out with existing Python 3 releases. Cheers, Nick.

...

Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Antoine Pitrou

11:07 a.m.

On Fri, 29 Aug 2014 18:39:35 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:

...

This actually gets a little messy once you start digging into it, as you actually have up to 3 streams to deal with (stdin, stdout, stderr), and may want to set the error handler in addition to the encoding.

At this point, I'd suggest creating a clean, separate TextPopen class (or subclass) without any legacy argument baggage. As for per-stream settings, we could allow passing each of *encoding* and *errors* in two forms: - as a string, in which case it applies to all three pipes - as a dict, in which case it is looked up for each of the "stdin", "stdout", "stderr" pipes (if configured as pipes) We could then deprecate the bogus-named "universal_newlines" in the main Popen class. Regards Antoine.

Paul Moore

12:05 p.m.

On 30 August 2014 12:07, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Fri, 29 Aug 2014 18:39:35 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:

...
This actually gets a little messy once you start digging into it, as you actually have up to 3 streams to deal with (stdin, stdout, stderr), and may want to set the error handler in addition to the encoding.

At this point, I'd suggest creating a clean, separate TextPopen class (or subclass) without any legacy argument baggage.

As for per-stream settings, we could allow passing each of *encoding* and *errors* in two forms: - as a string, in which case it applies to all three pipes - as a dict, in which case it is looked up for each of the "stdin", "stdout", "stderr" pipes (if configured as pipes)

We could then deprecate the bogus-named "universal_newlines" in the main Popen class.

Sounds reasonable. I'll look into that (no promises on timescales :-)) In practice, I doubt we'd need per-stram encodings particularly often, so I like the idea of *not* clutteringthe API to cope with them. I'm curious, by the way - what arguments do you consider as "legacy baggage" (a lot of them seem to me to be OS-specific and/or specialised rather than legacy). In practice, we'd probably need to do something about the utility functions like check_output and communicate as well. Paul

Antoine Pitrou

12:10 p.m.

On Sat, 30 Aug 2014 13:05:02 +0100 Paul Moore <p.f.moore@gmail.com> wrote:

...

Sounds reasonable. I'll look into that (no promises on timescales :-)) In practice, I doubt we'd need per-stram encodings particularly often, so I like the idea of *not* clutteringthe API to cope with them. I'm curious, by the way - what arguments do you consider as "legacy baggage" (a lot of them seem to me to be OS-specific and/or specialised rather than legacy).

I was thinking mostly about universal_newlines. Perhaps preexec_fn applies too, since it's dangerous (read: unstable). Regards Antoine.

Moritz Beber

12:37 p.m.

Hi, On Sat, Aug 30, 2014 at 2:10 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Sat, 30 Aug 2014 13:05:02 +0100 Paul Moore <p.f.moore@gmail.com> wrote:

...
Sounds reasonable. I'll look into that (no promises on timescales :-)) In practice, I doubt we'd need per-stram encodings particularly often, so I like the idea of *not* clutteringthe API to cope with them. I'm curious, by the way - what arguments do you consider as "legacy baggage" (a lot of them seem to me to be OS-specific and/or specialised rather than legacy).

I was thinking mostly about universal_newlines. Perhaps preexec_fn applies too, since it's dangerous (read: unstable).

preexec_fn is important though if you want to run something with different uid and gid from a sudo script, for example. Cheers, Moritz

Nick Coghlan

11:30 p.m.

On 30 Aug 2014 21:08, "Antoine Pitrou" <solipsis@pitrou.net> wrote:

...

On Fri, 29 Aug 2014 18:39:35 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:

...
This actually gets a little messy once you start digging into it, as you actually have up to 3 streams to deal with (stdin, stdout, stderr), and

may

...

...
want to set the error handler in addition to the encoding.

At this point, I'd suggest creating a clean, separate TextPopen class (or subclass) without any legacy argument baggage.

As for per-stream settings, we could allow passing each of *encoding* and *errors* in two forms: - as a string, in which case it applies to all three pipes - as a dict, in which case it is looked up for each of the "stdin", "stdout", "stderr" pipes (if configured as pipes)

We could then deprecate the bogus-named "universal_newlines" in the main Popen class.

That sounds like a plausible plan to me - splitting the API like that also reflects what was needed to get IO working sensibly in the first place. Cheers, Nick.

Akira Li

September 2014

7:14 p.m.

Paul Moore <p.f.moore@gmail.com> writes:

...

I propose adding an "encoding" parameter to subprocess.Popen (and the various wrapper routines) to allow specifying the actual encoding to use.

Obviously, you can simply wrap the binary streams yourself - the main use for this facility would be in the higher level functions like check_output and communicate.

Does this seem like a reasonable suggestion?

Could you provide examples how the final result could look like? For example, to read utf-8 encoded byte stream as a text with universal newline mode enabled: with (Popen(cmd, stdout=PIPE, bufsize=1) as p, TextIOWrapper(p.stdout, encoding='utf-8') as pipe): for line in pipe: process(line) Or the same, all at once: lines = check_output(cmd).decode('utf-8').splitlines() #XXX issue22232 -- Akira

Paul Moore

7:29 p.m.

On 1 September 2014 20:14, Akira Li <4kir4.1i@gmail.com> wrote:

...

Could you provide examples how the final result could look like?

Do you mean what I'm proposing? p = Popen(..., encoding='utf-8') p.stdout is now a text stream assuming the data is in UTF8, rather than assuming it's in the default encoding.

...

For example, to read utf-8 encoded byte stream as a text with universal newline mode enabled:

with (Popen(cmd, stdout=PIPE, bufsize=1) as p, TextIOWrapper(p.stdout, encoding='utf-8') as pipe): for line in pipe: process(line)

That looks like sort of what I had in mind as a workaround. I hadn't tried it to confirm it worked, though.

...

Or the same, all at once:

lines = check_output(cmd).decode('utf-8').splitlines() #XXX issue22232

Yes, essentially, although the need for an explicit decode feels a bit ugly to me... Paul

Akira Li

8:08 p.m.

Paul Moore <p.f.moore@gmail.com> writes:

...

On 1 September 2014 20:14, Akira Li <4kir4.1i@gmail.com> wrote:

...
Could you provide examples how the final result could look like?

Do you mean what I'm proposing?

p = Popen(..., encoding='utf-8') p.stdout is now a text stream assuming the data is in UTF8, rather than assuming it's in the default encoding.

What if you want to specify an error handler e.g., to read a file list from `find -print0` -like program: you could pass errors='surrogateescape', newlines='\0' (issue1152248) to TextIOWrapper(p.stdin). Both errors and newlines can be different for stdin/stdout pipes.

...

...
For example, to read utf-8 encoded byte stream as a text with universal newline mode enabled:

with (Popen(cmd, stdout=PIPE, bufsize=1) as p, TextIOWrapper(p.stdout, encoding='utf-8') as pipe): for line in pipe: process(line)

That looks like sort of what I had in mind as a workaround. I hadn't tried it to confirm it worked, though.

...
Or the same, all at once:

lines = check_output(cmd).decode('utf-8').splitlines() #XXX issue22232

Yes, essentially, although the need for an explicit decode feels a bit ugly to me...

-- Akira

Andrew Barnert

8:37 p.m.

On Monday, September 1, 2014 1:10 PM, Akira Li <4kir4.1i@gmail.com> wrote:

...

Paul Moore <p.f.moore@gmail.com> writes:

...
On 1 September 2014 20:14, Akira Li <4kir4.1i@gmail.com> wrote:

...
Could you provide examples how the final result could look like?

Do you mean what I'm proposing?

p = Popen(..., encoding='utf-8') p.stdout is now a text stream assuming the data is in UTF8, rather than assuming it's in the default encoding.

What if you want to specify an error handler e.g., to read a file list from `find -print0` -like program: you could pass errors='surrogateescape', newlines='\0' (issue1152248) to TextIOWrapper(p.stdin).

Presumably you either meant passing them to `TextIOWrapper(p.stdout)` for `find -print0`, or passing them to `TextIOWrapper(p.stdin)` for `xargs -0`; find doesn't even look at its input.

...

Both errors and newlines can be different for stdin/stdout pipes.

This brings up a good point: having a single encoding, errors, and newlines set of parameters for Popen and the convenience functions implies that you want to pass the same ones to all pipes. But how often is that true? In your particular case, for `find -print0`, you want `newlines='\0'` on stdout, but not on stderr. For the convenience methods that's probably not an issue, because the only way to read both stdout and stderr is to reroute the latter to the former anyway. But even there, you might not necessarily want input and output to be the same—`xargs -0` is a perfect example of that. And, even forgetting #1152248, it's not hard to think of cases where you want input and output to be different. For example, I've got an old script that selects and cats a bunch of old Excel-format CSV files (in CP-1252, CRLF) off a file server, based on input data in native text files (which on my machine means UTF-8, LF). Using it with binary pipes is pretty easy, changing it to explicitly wrap each pipe in the appropriate `TextIOWrapper` would be easy, being able to pass an encoding and newline value to the Popen would be misleading… But as long as there are enough use cases for wanting to pass the same arguments for all pipes, I think the suggestion is OK. Especially considering that often you only want one pipe in the first place, which counts as a use case for passing the same arguments for all 1 pipe, right? (By the way, thanks for this reminder to finish testing and cleaning up that patch for #1152248…)

Akira Li

8:53 p.m.

Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> writes:

...

On Monday, September 1, 2014 1:10 PM, Akira Li <4kir4.1i@gmail.com> wrote:

...
Paul Moore <p.f.moore@gmail.com> writes:

...
On 1 September 2014 20:14, Akira Li <4kir4.1i@gmail.com> wrote:

...
Could you provide examples how the final result could look like?

Do you mean what I'm proposing?

p = Popen(..., encoding='utf-8') p.stdout is now a text stream assuming the data is in UTF8, rather than assuming it's in the default encoding.

What if you want to specify an error handler e.g., to read a file list from `find -print0` -like program: you could pass errors='surrogateescape', newlines='\0' (issue1152248) to TextIOWrapper(p.stdin).

Presumably you either meant passing them to `TextIOWrapper(p.stdout)` for `find -print0`, or passing them to `TextIOWrapper(p.stdin)` for xargs -0`; find doesn't even look at its input.

You are right. I've looked at 'surrogateescape' that means reading, that is associated with sys.stdin; so I wrote p.stdin instead of p.stdout by mistake. -- Akira

Paul Moore

9:15 p.m.

On 1 September 2014 21:37, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:

...

This brings up a good point: having a single encoding, errors, and newlines set of parameters for Popen and the convenience functions implies that you want to pass the same ones to all pipes. But how often is that true?

My proposal was purely for encoding, and was prompted by the fact that the Windows default encoding does not support all of Unicode. Setting PYTHONIOENCODING to utf-8 for a Python subprocess allows handling of all of Unicode if you can set the subprocess channels' encoding to utf-8. As PYTHONIOENCODING affects all 3 channels, being able to set a single value for all 3 channels is sufficient for that use case. Setting newline and the error handler were *not* part of my original proposal, essentially because I know of no other way to force a subprocess to use anything other than the default encoding for the standard IO streams. Handling programs that are defined as using the standard streams for anything other than normal text (nul-terminated lines, explicitly defined non-default encodings) isn't something I have any examples of. The find -print0 example is out of scope, IMO, as newline handling is different from encoding. At some point, it becomes easier to manually wrap the streams rather than having huge numbers of parameters to the Popen constructor. I'll think some more on this... Paul

Nick Coghlan

1:25 p.m.

On 2 September 2014 07:15, Paul Moore <p.f.moore@gmail.com> wrote:

...

The find -print0 example is out of scope, IMO, as newline handling is different from encoding. At some point, it becomes easier to manually wrap the streams rather than having huge numbers of parameters to the Popen constructor.

Don't forget Antoine's suggestion of creating a TextPopen subclass that wraps the streams as strict UTF-8 by default and allows the encoding and errors arguments to be either strings (affecting all pipes) or a dictionary mapping "stdin", "stdout" and "stderr" to individual settings. With that, the simple utf-8 example just becomes: with TextPopen(cmd, stdout=PIPE) as p: for line in p.stdout: process(line)

...

I'll think some more on this...

For your torture test, consider the "iconv" (or "win_iconv") utility, which does encoding conversions, and how you might test that from a Python program without needing to do your own encoding and decoding, but instead let the subprocess module handle it for you :) (There's a flip side to that problem which is the question of *writing* an iconv utility in Python 3, and that's why there's an open RFE to support changing the encoding of an existing stream) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Paul Moore

1:43 p.m.

On 2 September 2014 14:25, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

On 2 September 2014 07:15, Paul Moore <p.f.moore@gmail.com> wrote:

...
The find -print0 example is out of scope, IMO, as newline handling is different from encoding. At some point, it becomes easier to manually wrap the streams rather than having huge numbers of parameters to the Popen constructor.

Don't forget Antoine's suggestion of creating a TextPopen subclass that wraps the streams as strict UTF-8 by default and allows the encoding and errors arguments to be either strings (affecting all pipes) or a dictionary mapping "stdin", "stdout" and "stderr" to individual settings.

With that, the simple utf-8 example just becomes:

with TextPopen(cmd, stdout=PIPE) as p: for line in p.stdout: process(line)

I'd not forgotten that, but it doesn't help for the -print0 case, which is about using nul as a line ending, and not about encodings. I'm going to carefully avoid getting sucked into that open issue here, and stick to only considering encodings :-)

...

...
I'll think some more on this...

For your torture test, consider the "iconv" (or "win_iconv") utility, which does encoding conversions, and how you might test that from a Python program without needing to do your own encoding and decoding, but instead let the subprocess module handle it for you :)

That's another good use case for this functionality. Paul

Terry Reedy

9:29 p.m.

On 9/2/2014 9:25 AM, Nick Coghlan wrote:

...

On 2 September 2014 07:15, Paul Moore <p.f.moore@gmail.com> wrote:

...
The find -print0 example is out of scope, IMO, as newline handling is different from encoding. At some point, it becomes easier to manually wrap the streams rather than having huge numbers of parameters to the Popen constructor.

Don't forget Antoine's suggestion of creating a TextPopen subclass

I would expect something call Textxxx to present with (text) strings, not bytes.

...

that wraps the streams as strict UTF-8 by default and allows the

But this implies to me that I would still get (encoded) bytes.

...

encoding and errors arguments to be either strings (affecting all pipes) or a dictionary mapping "stdin", "stdout" and "stderr" to individual settings.

What I would want is automatic conversion of strings to encoded bytes on send to the pipe and automatic reconersion of encoded bytes to strings on received. For that, there is little reason I can think of to use anything other than utf-8

...

With that, the simple utf-8 example just becomes:

with TextPopen(cmd, stdout=PIPE) as p: for line in p.stdout: process(line)

Would type(line) be str or bytes? -- Terry Jan Reedy

Nick Coghlan

10:20 p.m.

On 3 Sep 2014 07:30, "Terry Reedy" <tjreedy@udel.edu> wrote:

...

On 9/2/2014 9:25 AM, Nick Coghlan wrote:

...
On 2 September 2014 07:15, Paul Moore <p.f.moore@gmail.com> wrote:

...
The find -print0 example is out of scope, IMO, as newline handling is different from encoding. At some point, it becomes easier to manually wrap the streams rather than having huge numbers of parameters to the Popen constructor.

Don't forget Antoine's suggestion of creating a TextPopen subclass

I would expect something call Textxxx to present with (text) strings, not

bytes. Exactly.

...

...
that wraps the streams as strict UTF-8 by default and allows the

But this implies to me that I would still get (encoded) bytes.

I'm not sure how that follows - TextPopen is making the assumption *because* it is providing a str based API, and thus needs to know the appropriate text encoding details.

...

...
encoding and errors arguments to be either strings (affecting all pipes) or a dictionary mapping "stdin", "stdout" and "stderr" to individual settings.

What I would want is automatic conversion of strings to encoded bytes on send to the pipe and automatic reconersion of encoded bytes to strings on received. For that, there is little reason I can think of to use anything other than utf-8

Still plenty of other applications that use other encodings (and as I suggested to Paul, the real stress test for any proposed API is using it to call iconv to do an encoding conversion).

...

...
With that, the simple utf-8 example just becomes:

with TextPopen(cmd, stdout=PIPE) as p: for line in p.stdout: process(line)

Would type(line) be str or bytes?

str, otherwise this wouldn't be any different to the existing Popen behaviour. Cheers, Nick.

...

-- Terry Jan Reedy

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Akira Li

11:43 p.m.

Paul Moore <p.f.moore@gmail.com> writes:

...

On 1 September 2014 21:37, Andrew Barnert <abarnert@yahoo.com.dmarc.invalid> wrote:

...
This brings up a good point: having a single encoding, errors, and newlines set of parameters for Popen and the convenience functions implies that you want to pass the same ones to all pipes. But how often is that true?

My proposal was purely for encoding, and was prompted by the fact that the Windows default encoding does not support all of Unicode. Setting PYTHONIOENCODING to utf-8 for a Python subprocess allows handling of all of Unicode if you can set the subprocess channels' encoding to utf-8. As PYTHONIOENCODING affects all 3 channels, being able to set a single value for all 3 channels is sufficient for that use case.

Setting newline and the error handler were *not* part of my original proposal, essentially because I know of no other way to force a subprocess to use anything other than the default encoding for the standard IO streams. Handling programs that are defined as using the standard streams for anything other than normal text (nul-terminated lines, explicitly defined non-default encodings) isn't something I have any examples of.

The find -print0 example is out of scope, IMO, as newline handling is different from encoding. At some point, it becomes easier to manually wrap the streams rather than having huge numbers of parameters to the Popen constructor.

I'll think some more on this...

PYTHONIOENCODING allows to specify the error handler e.g., to avoid exceptions while reading list of files: $ ls | PYTHONIOENCODING=:surrogateescape python3 -c 'import sys; print(list(sys.stdin))' Or the same but with TextPopen suggested by Antoine: with TextPopen(['ls'], stdout=PIPE, ioencoding=':surrogateescape') as p: for filename in p.stdout: process(filename) os.fsencode(filename) would get original bytes. Note: ioencoding parameter is my interpretation. -- Akira

Paul Moore

7:02 a.m.

3 September 2014 00:43, Akira Li <4kir4.1i@gmail.com> wrote:

...

PYTHONIOENCODING allows to specify the error handler e.g., to avoid exceptions while reading list of files:

Thanks, I hadn't realised that. Paul

MRAB

8:05 p.m.

On 2014-09-01 20:14, Akira Li wrote:

...

Paul Moore <p.f.moore@gmail.com> writes:

...
I propose adding an "encoding" parameter to subprocess.Popen (and the various wrapper routines) to allow specifying the actual encoding to use.

Obviously, you can simply wrap the binary streams yourself - the main use for this facility would be in the higher level functions like check_output and communicate.

Does this seem like a reasonable suggestion?

Could you provide examples how the final result could look like?

For example, to read utf-8 encoded byte stream as a text with universal newline mode enabled:

with (Popen(cmd, stdout=PIPE, bufsize=1) as p, TextIOWrapper(p.stdout, encoding='utf-8') as pipe): for line in pipe: process(line)

You can parenthesise multiple context managers like that, and, anyway, I think it would be clearer as: with Popen(cmd, stdout=PIPE, bufsize=1) as p: for line in TextIOWrapper(p.stdout, encoding='utf-8'): process(line)

...

Or the same, all at once:

lines = check_output(cmd).decode('utf-8').splitlines() #XXX issue22232

Akira Li

8:33 p.m.

MRAB <python@mrabarnett.plus.com> writes:

...

On 2014-09-01 20:14, Akira Li wrote:

...
Paul Moore <p.f.moore@gmail.com> writes:

...
I propose adding an "encoding" parameter to subprocess.Popen (and the various wrapper routines) to allow specifying the actual encoding to use.

Obviously, you can simply wrap the binary streams yourself - the main use for this facility would be in the higher level functions like check_output and communicate.

Does this seem like a reasonable suggestion?

Could you provide examples how the final result could look like?

For example, to read utf-8 encoded byte stream as a text with universal newline mode enabled:

with (Popen(cmd, stdout=PIPE, bufsize=1) as p, TextIOWrapper(p.stdout, encoding='utf-8') as pipe): for line in pipe: process(line)

You can parenthesise multiple context managers like that, and, anyway, You mean: "can't". I know [1]

[1] https://mail.python.org/pipermail/python-dev/2014-August/135769.html

...

I think it would be clearer as:

with Popen(cmd, stdout=PIPE, bufsize=1) as p: for line in TextIOWrapper(p.stdout, encoding='utf-8'): process(line)

It is a habit to use the explicit with-statement for file-like objects. You are right -- it is not necessary in this case. Though with-statement forces file.close() in time and you don't need to consider carefully what happens if it is called by a garbage collector (if at all) at some indeterminate time in the future. -- Akira

3816

Age (days ago)

3821

Last active (days ago)

List overview

Download

20 comments

8 participants

participants (8)

Akira Li
Andrew Barnert
Antoine Pitrou
Moritz Beber
MRAB
Nick Coghlan
Paul Moore
Terry Reedy

Subprocess: Add an encoding argument

tags

participants (8)