[Distutils] PEP 517 - specifying build system in pyproject.toml

Mon May 22 15:53:02 EDT 2017

On 22 May 2017 at 18:38, Steve Dower <steve.dower at python.org> wrote:
> Okay, I think I get the problem now. We expect backends to let child
> subprocesses just spit out whatever *they* want onto the same stdout/stderr.

s/expect/allow/

The paranoid in me suspects "expect" is also true, though :-)

> I'm really not a fan of forcing front ends to clean up that mess, and so I'd
> still suggest that the backend "tool" be a script to launch the actual tool
> and do the conversion to UTF-8.

What you're referring to as the backend "tool" being a script, is what
the PEP refers to as a "shim" (as Nick pointed out to me) and is
considered part of the front end. The back end is a set of Python APIs
which are called by the front end (in any real life front end, via the
front end's shim script).

> Perhaps the middle ground is to specify encoding='utf-8', errors='anything
> but strict' for front-ends, and well-behaved backends should do the work to
> transcode when it is known to be necessary for the tools they run. (i.e.
> frontends do not crash, backends have a simple rule for avoiding loss of
> data).

For front ends, "never crash" is essential. But "produce as readable
as possible data" is also a high priority. Consider for example a
Russian user with a series of directories named in Russian. If the
tools write an error using his local 8-bit encoding, and the front end
assumes UTF-8, then all of the high-bit characters in his directory
names would be replaced. Deciphering an error message like "File
???????/?????/?????.c: unexpected EOF" is problematic... :-(

The model assumes that most front-ends would call the backend via a
subprocess "shim" that was maintained by the front end project. But
the expectation here seems to be that the backend is allowed to write
directly to the stdio streams of its process (or at least, to let the
tools it calls do so). So the shim *cannot* control the encoding of
the data received by the frontend, and so the encoding has to be
agreed between backend and frontend. The basic question is how the
responsibility for dealing with data in an uncertain encoding is
allocated.

It seems to me there are 2 schools of thought:

1. There are likely to be fewer front ends than back ends, and so the
front end(s) (basically, pip) should deal with the problem. Also,
backends are more likely to be written by developers who are looking
at very specific scenarios, and asking them to handle all the
complexities of robust multilingual coding is raising the bar on
writing a backend too high.

2. The backend is where the problem lies, and so the backend should
address the issue. Furthermore, a well-established principle in
dealing with encodings is to convert to strings right at the boundary
of the application, and in this case the backend is the only code that
has access to that boundary.

(I tend towards (2), but I honestly can't say to what extent that's
because it makes it "someone else's problem" for me ;-))

As you say, the middle ground here is that front ends must never
crash, and back ends should (but aren't required to) produce output in
a specified encoding (I still prefer the locale encoding as that has
the best chance of avoiding the ????/???? issue). That's more or less
what pip has to deal with now (and not that far off (1)), and my
current attempt to address that situation is at
https://github.com/pypa/pip/pull/4486 for what it's worth.

A couple of final thoughts. I would expect that testing the handling
of encodings is likely to be an important issue (at least, I expect
there'll be bugs, and adding tests to make sure they get properly
fixed will be important). Handling tool output encoding in the backend
is likely to involve relatively low level interface functions, where
the inputs and outputs can be relatively easily mocked. So I would
expect backend unit testing of encoding handling would be relatively
straightforward. Conversely, testing front end handling of encoding
issues is very tricky - it's necessary to set up system state to
persuade the build tools to produce the data you want to test against
(it feels like integration testing rather than unit testing). Also,
fixing encoding issues in the backend decouples the fix from pip's
release cycle, which is probably a good thing (unless the backend is
not well maintained, but that's an issue in itself).

Paul