[Distutils] PEP 517 - specifying build system in pyproject.toml

Paul Moore p.f.moore at gmail.com
Sat May 20 05:11:19 EDT 2017


On 20 May 2017 at 09:03, Thomas Kluyver <thomas at kluyver.me.uk> wrote:
> On Sat, May 20, 2017, at 07:54 AM, Nick Coghlan wrote:
>> * on platforms with 8-bit standard streams (e.g. Linux, Mac OS X),
>> build systems SHOULD emit UTF-8 encoded output
>> * on platforms with 16-bit standard streams (e.g. Windows), build
>> systems SHOULD emit UTF-16-LE encoded output
>
> I'm quite prepared to accept that I'm mistaken, but my understanding is
> that *standard streams* are 8-bit on Windows as well. The 16-bit thing
> that Python 3.6 does, as I understand it, is to bypass standard streams
> when it detects that they're connected to a console, and use a Windows
> API call to write text to the console directly as UTF-16.
>
> If so, when stdout/stderr are pipes, which I assume is how pip captures
> the output from build processes, there's no particular reason to send
> UTF-16 data just because it's Windows.

That's my understanding too. The standard streams are still byte
streams with an encoding. It's just that the underlying IO when the
final destination is the console, is done by the Windows Unicode APIs.
Because of this, when the output is the console the stream can accept
any unicode character and so an encoding of UTF8 is specified (and
yes, AIUI there is a translation Unicode string -> UTF-8 bytes ->
Unicode console API). For non-console output, though, the standard
streams are still byte streams and the platform behaviour is
respected, so we use the ANSI codepage (calling this the platform
standard glosses over the fact that there are two standard codepages,
ANSI and OEM, and tools don't always make the same choice when faced
with piped output). Long story short, UTF-16 is irrelevant here.

The docs for 3.6 say "Under Windows, if the stream is interactive
(that is, if its isatty() method returns True), the console codepage
is used, otherwise the ANSI code page". This is out of date (it was
true for 3.5 and earlier). In 3.6+ utf-8 is used for interactive
streams rather than the console codepage:

>py -c "import sys; print(sys.stdout.encoding, file=sys.stderr)"
utf-8
>py -c "import sys; print(sys.stdout.encoding, file=sys.stderr)" >$null
cp1252

The bigger question, though, is to what extent we want to mandate that
build tools that run external tools such as compilers take
responsibility for the encoding of the output of those tools (rather
than simply passing the output through to the output stream
unmodified). And if we do want to, whether we want to allow an
exception for setuptools/distutils.

Also, a question regarding Unix - do we really want to mandate UTF-8
even if the system locale is set to something else? Won't that mean
that build tools have the same problem with compilers generating
output in the encoding the tool wants that we already have on Windows?

My feeling is:

1. Build systems SHOULD emit output encoded in the preferred locale
encoding (normally UTF-8 on Unix, ANSI on Windows).
2. Build systems should ideally check the encoding used by external
tools that they run and transcode to the correct encoding if necessary
- but this is a quality of implementation matter.
3. Install tools MUST NOT fail if build tools produce output with the
wrong encoding, but MUST correctly reproduce build tool output if the
build tools do produce the right encoding.

My biggest concern with this is that I believe that Visual C produces
output in the OEM codepage even when output to a pipe. Actually I just
did some experiments (VS 2015), and it's even worse than that. The
compiler (cl) seems to use the OEM code page when writing to a pipe,
but the linker uses the ANSI code page. This means that a command like
"cl a£bc" produces output on (a piped) stdout that contains mixed
encodings. Given this situation, I think we have to simply give up and
take the view that the Visual C tools are simply broken in this
regard, and we shouldn't worry about them. So I'm inclined therefore
to drop point (2) from the 3 above.

Paul


More information about the Distutils-SIG mailing list