[Distutils] PEP 517 - specifying build system in pyproject.toml

Tue May 23 08:41:49 EDT 2017

On Tue, May 23, 2017, at 12:56 PM, Paul Moore wrote:
> So based on your proposal, won't you introduce similar bugs by using
> print() without sorting out encodings? Unless (see below) you assume
> that the frontend sorts it out for you.

If you strictly follow the locale encoding, you need to sort it out in
Python anyway, in case the stdout encoding has been overridden by
PYTHONIOENCODING, or PYTHONSTARTUP, or the infernal .pth files. I accept
that those are corner cases, though.

> Yes, subprocesses that produce a known encoding are trivial to deal
> with. But remembering that you *need* to deal with them less so. My
> concern here is the same one as you quote above - assuming that
> subprocess returns UTF-8 encoded bytes, because it does on Linux and
> Mac.

I agree, that is a concern.

> But if you genuinely don't know (or worse, know there is no consistent
> encoding) I'm not sure I see how passing unknown bytes onto the
> frontend, which by necessity has less context to guess what those
> bytes might mean, is the right answer. The frontend is better able to
> know what it wants to *do* with those bytes, but "convert them to text
> for the user to see" is the only real answer here IMO (sure, dumping
> the raw bytes to a file might be an option, but I imagine it'll be a
> relatively uncommon choice).

I was indeed thinking of dumping them to a file. It's not very user
friendly, but it means the information is there if you need it. I
suspect that regardless of the locale, technical information like code
and filesystem paths will often contain enough ASCII that a human can
interpret them even if non-ASCII characters are wrongly encoded. So I
hope that needing to reverse-engineer the encoding will be relatively
rare.

The appeal of this is that it follows "in the face of ambiguity, refuse
the temptation to guess". If the backend guesses the encoding
incorrectly, the frontend gets valid UTF-8, but is no better able to
display it meaningfully, and you then need to go through
decode-encode-decode to recover the original text, even if no data was
lost.

Another option: if the backend runs a subprocess with unknown output
encoding, it redirects that output to a temp file and prints the path in
its own output. Then there's a better chance that the unknown encoding
is at least consistent within the file, so tools can do encoding
detection on it.

> At the end of the day, there is no perfect answer here. Someone is
> going to have to make a judgement call, and as the PEP author, I guess
> that's you. So at this point I'll stop badgering you and leave it up
> to you to decide what the consensus is. Thanks for listening to my
> points, though.

I know what I think, but I don't feel like there's a consensus as yet.

Can I take a quick poll of what people following this topic think?

Q1: Default encoding for captured build stdout/stderr
a. UTF-8 (consistent, can represent any character)
b. Locale default (convenient if backend runs subprocesses which produce
output in the locale encoding)

Q2: Handling unknown encodings from subprocesses
a. Backend should ensure all output is valid in the target encoding
(Q1), though it may not be accurate.
b. Unknown output may be passed on as bytes without transcoding, so the
frontend can e.g. dump it to a file.

I'm currently 1:a, 2:?a .

Thomas