[Distutils] PEP 517 - specifying build system in pyproject.toml
Paul Moore
p.f.moore at gmail.com
Tue May 23 07:56:40 EDT 2017
On 23 May 2017 at 12:36, Thomas Kluyver <thomas at kluyver.me.uk> wrote:
> As you described earlier, though, even using a locale dependent encoding
> doesn't really avoid this problem, because of tools using OEM vs ANSI
> codepages on Windows. And if PYTHONIOENCODING is set, Python processes
> will use that over the locale encoding. I think we're ultimately better
> off specifying a consistent encoding rather than trying to guess about
> it.
Agreed it doesn't avoid the problem. But it does minimise it. I don't
see any huge advantage in having a consistent encoding across
platforms though - having a consistent *rule*, yes, but "use the
locale encoding" is such a rule as well.
> I'm also thinking of all the bugs I've seen (and written) by assuming
> open() in text mode defaults to UTF-8 encoding - as it does on the Linux
> and Mac computers many open source developers use, but not on Windows,
> nor in all Linux configurations.
So based on your proposal, won't you introduce similar bugs by using
print() without sorting out encodings? Unless (see below) you assume
that the frontend sorts it out for you.
> So I'd recommend that backends running processes for which they know the
> encoding should transcode it to UTF-8. I expect we can make standard
> utility functions to wait for a subprocess to finish while reading,
> transcoding, and repeating its output.
Yes, subprocesses that produce a known encoding are trivial to deal
with. But remembering that you *need* to deal with them less so. My
concern here is the same one as you quote above - assuming that
subprocess returns UTF-8 encoded bytes, because it does on Linux and
Mac.
> I'm still not sure what the backend should do when it runs something for
> which it doesn't know the output encoding. The possibilities are either:
>
> - Take a best guess and transcode it to UTF-8, which may risk losing
> some information, but keeps the output as valid UTF-8
> - Pass through the raw bytes, ensuring that no information is lost, but
> leaving it up to the frontend/user to deal with that.
There's never a good answer here. The "correct" answer is to do
research and establish what encoding the tool uses, but that's often
stupidly difficult.
But if you genuinely don't know (or worse, know there is no consistent
encoding) I'm not sure I see how passing unknown bytes onto the
frontend, which by necessity has less context to guess what those
bytes might mean, is the right answer. The frontend is better able to
know what it wants to *do* with those bytes, but "convert them to text
for the user to see" is the only real answer here IMO (sure, dumping
the raw bytes to a file might be an option, but I imagine it'll be a
relatively uncommon choice).
At the end of the day, there is no perfect answer here. Someone is
going to have to make a judgement call, and as the PEP author, I guess
that's you. So at this point I'll stop badgering you and leave it up
to you to decide what the consensus is. Thanks for listening to my
points, though.
Paul
More information about the Distutils-SIG
mailing list