[Distutils] PEP 517 - specifying build system in pyproject.toml

Thomas Kluyver thomas at kluyver.me.uk
Tue May 23 07:36:22 EDT 2017


On Tue, May 23, 2017, at 11:04 AM, Paul Moore wrote:
> However, if we do this then we have a situation where existing build
> tools (compilers, etc) that we have to support still use platform
> dependent encodings. That's a reality that we can't wish away. And the
> majority of real-life issues reported on pip are with compilation
> errors. So do we require backends that run these tools to ensure that
> they transcode the output, or do we risk significant output
> corruption, because (essentially) every high-bit character in the
> compiler output will be replaced as it's invalid UTF-8?

As you described earlier, though, even using a locale dependent encoding
doesn't really avoid this problem, because of tools using OEM vs ANSI
codepages on Windows. And if PYTHONIOENCODING is set, Python processes
will use that over the locale encoding. I think we're ultimately better
off specifying a consistent encoding rather than trying to guess about
it.

I'm also thinking of all the bugs I've seen (and written) by assuming
open() in text mode defaults to UTF-8 encoding - as it does on the Linux
and Mac computers many open source developers use, but not on Windows,
nor in all Linux configurations.

So I'd recommend that backends running processes for which they know the
encoding should transcode it to UTF-8. I expect we can make standard
utility functions to wait for a subprocess to finish while reading,
transcoding, and repeating its output.

I'm still not sure what the backend should do when it runs something for
which it doesn't know the output encoding. The possibilities are either:

- Take a best guess and transcode it to UTF-8, which may risk losing
some information, but keeps the output as valid UTF-8
- Pass through the raw bytes, ensuring that no information is lost, but
leaving it up to the frontend/user to deal with that.

Thomas


More information about the Distutils-SIG mailing list