command line versus python API for build system abstraction (was Re: build system abstraction PEP)
On Sun, Nov 8, 2015 at 5:28 PM, Robert Collins
+The use of a command line API rather than a Python API is a little +contentious. Fundamentally anything can be made to work, and Robert wants to +pick something thats sufficiently lowest common denominator that +implementation is straight forward on all sides. Picking a CLI for that makes +sense because all build systems will need a CLI for end users to use anyway.
I agree that this is not terribly important, and anything can be made to work. Having pondered it all for a few more weeks though I think that the "entrypoints-style" interface actually is unambiguously better, so let me see about making that case. What's at stake? ---------------------- Option 1, as in Robert's PEP: The build configuration file contains a string like "flit --dump-build-description" (or whatever), which names a command to run, and then a protocol for running this command to get information on the actual build system interface. Build operations are performed by executing these commands as subprocesses. Option 2, my preference: The build configuration file contains a string like "flit:build_system_api" (or whatever) which names a Python object accessed like import flit flit.build_system_api (This is the same syntax used for naming entry points.) Which would then have attributes and methods describing the actual build system interface. Build operations are performed by calling these methods. Why does it matter? ---------------------------- First, to be clear: I think that no matter which choice we make here, the final actual execution path is going to end up looking very similar. Because even if we go with the entry-point-style Python hooks, the build frontends like pip will still want to spawn a child to do the actual calls -- this is important for isolating pip from the build backend and the build backend from pip, it's important because the build backend needs to execute in a different environment than pip itself, etc. So no matter what, we're going to have some subprocess calls and some IPC. The difference is that in the subprocess approach, the IPC machinery is all written into the spec, and build frontends like pip implement one half while build backends implement the other half. In the Python API approach, the spec just specifies the Python calling conventions, and both halves of the IPC code live are implemented inside each build backend. Concretely, the way I imagine this would work is that pip would set up the build environment, and then it would run build-environment/bin/python path/to/pip-worker-script.py <args> where pip-worker-script.py is distributed as part of pip. (In simple cases it could simply be a file inside pip's package directory; if we want to support execution from pip-inside-a-zip-file then we need a bit of code to unpack it to a tempfile before executing it. Creating a tempfile is not a huge additional burden given that by the time we call build hooks we will have already created a whole temporary python environment...) In the subprocess approach, we have to write a ton of text describing all the intricacies of IPC. We have to specify how the command line gets split (or is it passed to the shell?), and specify a JSON-based protocol, and what happens to stdin/stdout/stderr, and etc. etc. In the Python API approach, we still have to do all the work of figuring these things out, but they would live inside pip's code, instead of in a PEP. The actual PEP text would be much smaller. It's not clear which approach leads to smaller code overall. If there are F frontends and B backends, then in the subprocess approach we collectively have to write F+B pieces of IPC code, and in the Python API approach we collectively have to write 2*F pieces of IPC code. So on this metric the Python API is a win if F < B, which would happen if e.g. everyone ends up using pip for their frontend but with lots of different backends, which seems plausible? But who knows. But now suppose that there's some bug in that complicated IPC protocol (which I would rate as about a 99.3% likelihood in our first attempt, because cross-platform compatible cross-process IPC is super annoying and fiddly). In the subprocess approach, fixing this means that we need to (a) write a PEP, and then (b) fix F+B pieces of code simultaneously on some flag day, and possibly test F*B combinations for correct interoperation. In the Python API approach, fixing this means patching whichever frontend has the bug, no PEPs or flag days necessary. In addition, the ability to evolve the two halves of the IPC channel together allows for better efficiency. For example, in Robert's current PEP there's some machinery added that hopes to let pip cache the result of the "--dump-build-description" call. This is needed because in the subprocess approach, the minimum number of subprocess calls you need to do something is two: one to ask what command to call, and a second to actually execute the command. In the python API approach, you can just go ahead and spawn a subprocess that knows what method it wants to call, and it can locate that method and then call it in a single shot, thus avoiding the need for an error-prone caching scheme. And the flexibility also helps in the face of future changes, too. Like, suppose that we start out with a do_build hook, and then later add a do_build2 hook that takes an extra argument or something, and pip wants to call do_build2 if it exists, and fall back on do_build otherwise. In the subprocess approach, you have to get the build description, check which hooks are provided, and then once you've decided which one you want to call you can spawn a second subprocess to do that. In the python API approach, pip can move this fallback logic directly into its hook-calling worker. (If it wants to.) So it still avoids the extra subprocess call. Finally, I think that it probably is nicer for pip to bite the bullet and take on more of the complexity budget here in order to make things simpler for build backends, because pip is already a highly complex project that undergoes lots of scrutiny from experts, which is almost certainly not going to be as true for all build backends. And the Python API approach is dead simple to explain and implement for the build backend side. I understand that the pip devs who are reading this might disagree, which is why I also wrote down the (IMO) much more compelling arguments above :-). But hey, still worth mentioning... -n -- Nathaniel J. Smith -- http://vorpus.org
On 10 November 2015 at 15:03, Nathaniel Smith
On Sun, Nov 8, 2015 at 5:28 PM, Robert Collins
wrote: +The use of a command line API rather than a Python API is a little +contentious. Fundamentally anything can be made to work, and Robert wants to +pick something thats sufficiently lowest common denominator that +implementation is straight forward on all sides. Picking a CLI for that makes +sense because all build systems will need a CLI for end users to use anyway.
I agree that this is not terribly important, and anything can be made to work. Having pondered it all for a few more weeks though I think that the "entrypoints-style" interface actually is unambiguously better, so let me see about making that case.
What's at stake? ----------------------
Option 1, as in Robert's PEP:
The build configuration file contains a string like "flit --dump-build-description" (or whatever), which names a command to run, and then a protocol for running this command to get information on the actual build system interface. Build operations are performed by executing these commands as subprocesses.
Option 2, my preference:
The build configuration file contains a string like "flit:build_system_api" (or whatever) which names a Python object accessed like
import flit flit.build_system_api
(This is the same syntax used for naming entry points.) Which would then have attributes and methods describing the actual build system interface. Build operations are performed by calling these methods.
Option 3 expressed by Donald on IRC (and implied by his 'smaller step'
email - hard code the CLI).
A compromise position from 'setup.py
Because even if we go with the entry-point-style Python hooks, the build frontends like pip will still want to spawn a child to do the actual calls -- this is important for isolating pip from the build backend and the build backend from pip, it's important because the build backend needs to execute in a different environment than pip itself, etc.
[...]
Concretely, the way I imagine this would work is that pip would set up the build environment, and then it would run
build-environment/bin/python path/to/pip-worker-script.py <args>
fwiw, such a worker is what I was describing in an earlier thread with Robert last work https://mail.python.org/pipermail/distutils-sig/2015-October/027443.html although I wasn't arguing for it in that context, but rather just using it to be clear that a python api approach could still be used with build environment isolation
On 10 November 2015 at 04:03, Marcus Smith
although I wasn't arguing for it in that context, but rather just using it to be clear that a python api approach could still be used with build environment isolation
Which is a good point - it's easy enough to write adapters from one convention to another (I'm inclined to think it's easier to adapt a Python API to a CLI interface than the other way around, but I may be wrong about that). Paul
On Mon, Nov 9, 2015 at 6:11 PM, Robert Collins
On 10 November 2015 at 15:03, Nathaniel Smith
wrote: On Sun, Nov 8, 2015 at 5:28 PM, Robert Collins
wrote: +The use of a command line API rather than a Python API is a little +contentious. Fundamentally anything can be made to work, and Robert wants to +pick something thats sufficiently lowest common denominator that +implementation is straight forward on all sides. Picking a CLI for that makes +sense because all build systems will need a CLI for end users to use anyway.
I agree that this is not terribly important, and anything can be made to work. Having pondered it all for a few more weeks though I think that the "entrypoints-style" interface actually is unambiguously better, so let me see about making that case.
What's at stake? ----------------------
Option 1, as in Robert's PEP:
The build configuration file contains a string like "flit --dump-build-description" (or whatever), which names a command to run, and then a protocol for running this command to get information on the actual build system interface. Build operations are performed by executing these commands as subprocesses.
Option 2, my preference:
The build configuration file contains a string like "flit:build_system_api" (or whatever) which names a Python object accessed like
import flit flit.build_system_api
(This is the same syntax used for naming entry points.) Which would then have attributes and methods describing the actual build system interface. Build operations are performed by calling these methods.
Option 3 expressed by Donald on IRC
Where is this IRC channel, btw? :-)
(and implied by his 'smaller step' email - hard code the CLI).
A compromise position from 'setup.py
So this would give up on having schema versioning for the API, I guess?
I plan on using that approach in my next draft.
Your point about bugs etc is interesting, but the use of stdin etc in a dedicated Python API also needs to be specified.
Yes, but this specification is trivial: "Stdin is unspecified, and stdout/stderr can be used for printing status messages, errors, etc. just like you're used to from every other build system in the world." Similarly, we still have to specify how what the different operations are, what arguments they take, how they signal errors, etc. The point though is this specification will be shorter and simpler if we're specifying Python APIs than if we're specifying IPC APIs, because with a Python API we get to assume the existence of things like data structures and kwargs and exceptions and return values instead of having to build them from scratch. -n -- Nathaniel J. Smith -- http://vorpus.org
On 11 November 2015 at 08:44, Nathaniel Smith
On 10 November 2015 at 15:03, Nathaniel Smith
wrote: Similarly, we still have to specify how what the different operations are, what arguments they take, how they signal errors, etc. The point On Mon, Nov 9, 2015 at 6:11 PM, Robert Collins
wrote: though is this specification will be shorter and simpler if we're specifying Python APIs than if we're specifying IPC APIs, because with a Python API we get to assume the existence of things like data structures and kwargs and exceptions and return values instead of having to build them from scratch.
I think the potentially improved quality of error handling arising from a Python API based approach is well worth taking into account. When the backend interface is CLI based, you're limited to: 1. The return code 2. Typically unstructured stderr output This isn't like HTTP+JSON, where there's an existing rich suite of well-defined error codes to use, and an ability to readily include error details in the reply payload. The other thing is that if the core interface is Python API based, then if no hook is specified, there can be a default provider in pip that knows how to invoke the setup.py CLI (or perhaps even implements looking up the CLI to invoke from the source tree metadata). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 11 November 2015 at 18:53, Nick Coghlan
On 11 November 2015 at 08:44, Nathaniel Smith
wrote: On 10 November 2015 at 15:03, Nathaniel Smith
wrote: Similarly, we still have to specify how what the different operations are, what arguments they take, how they signal errors, etc. The point On Mon, Nov 9, 2015 at 6:11 PM, Robert Collins
wrote: though is this specification will be shorter and simpler if we're specifying Python APIs than if we're specifying IPC APIs, because with a Python API we get to assume the existence of things like data structures and kwargs and exceptions and return values instead of having to build them from scratch. I think the potentially improved quality of error handling arising from a Python API based approach is well worth taking into account. When the backend interface is CLI based, you're limited to:
1. The return code 2. Typically unstructured stderr output
This isn't like HTTP+JSON, where there's an existing rich suite of well-defined error codes to use, and an ability to readily include error details in the reply payload.
The other thing is that if the core interface is Python API based, then if no hook is specified, there can be a default provider in pip that knows how to invoke the setup.py CLI (or perhaps even implements looking up the CLI to invoke from the source tree metadata).
Its richer, which is both a positive and a negative. I appreciate the
arguments, but I'm not convinced at this point.
pip is going to be invoking a CLI *no matter what*. Thats a hard
requirement unless Python's very fundamental import behaviour changes.
Slapping a Python API on things is lipstick on a pig here IMO: we're
going to have to downgrade any richer interface; and by specifying the
actual LCD as the interface it is then amenable to direct exploration
by users without them having to reverse engineer an undocumented thunk
within pip.
-Rob
--
Robert Collins
On 11 November 2015 at 16:19, Robert Collins
On 11 November 2015 at 18:53, Nick Coghlan
wrote: On 11 November 2015 at 08:44, Nathaniel Smith
wrote: On 10 November 2015 at 15:03, Nathaniel Smith
wrote: Similarly, we still have to specify how what the different operations are, what arguments they take, how they signal errors, etc. The point On Mon, Nov 9, 2015 at 6:11 PM, Robert Collins
wrote: though is this specification will be shorter and simpler if we're specifying Python APIs than if we're specifying IPC APIs, because with a Python API we get to assume the existence of things like data structures and kwargs and exceptions and return values instead of having to build them from scratch. I think the potentially improved quality of error handling arising from a Python API based approach is well worth taking into account. When the backend interface is CLI based, you're limited to:
1. The return code 2. Typically unstructured stderr output
This isn't like HTTP+JSON, where there's an existing rich suite of well-defined error codes to use, and an ability to readily include error details in the reply payload.
The other thing is that if the core interface is Python API based, then if no hook is specified, there can be a default provider in pip that knows how to invoke the setup.py CLI (or perhaps even implements looking up the CLI to invoke from the source tree metadata).
Its richer, which is both a positive and a negative. I appreciate the arguments, but I'm not convinced at this point.
pip is going to be invoking a CLI *no matter what*. Thats a hard requirement unless Python's very fundamental import behaviour changes. Slapping a Python API on things is lipstick on a pig here IMO: we're going to have to downgrade any richer interface; and by specifying the actual LCD as the interface it is then amenable to direct exploration by users without them having to reverse engineer an undocumented thunk within pip.
I'm not opposed to documenting how pip talks to its worker CLI - I just share Nathan's concerns about locking that down in a PEP vs keeping *that* CLI within pip's boundary of responsibilities, and having a documented Python interface used for invoking build systems. However, I've now realised that we're not constrained even if we start with the CLI interface, as there's still a migration path to a Python API based model: Now: documented CLI for invoking build systems Future: documented Python API for invoking build systems, default fallback invokes the documented CLI So the CLI documented in the PEP isn't *necessarily* going to be the one used by pip to communicate into the build environment - it may be invoked locally within the build environment. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 11 November 2015 at 19:49, Nick Coghlan
On 11 November 2015 at 16:19, Robert Collins
wrote: ...>> pip is going to be invoking a CLI *no matter what*. Thats a hard requirement unless Python's very fundamental import behaviour changes. Slapping a Python API on things is lipstick on a pig here IMO: we're going to have to downgrade any richer interface; and by specifying the actual LCD as the interface it is then amenable to direct exploration by users without them having to reverse engineer an undocumented thunk within pip.
I'm not opposed to documenting how pip talks to its worker CLI - I just share Nathan's concerns about locking that down in a PEP vs keeping *that* CLI within pip's boundary of responsibilities, and having a documented Python interface used for invoking build systems.
I'm also very wary of something that would be an attractive nuisance. I've seen nothing suggesting that a Python API would be anything but: - it won't be usable [it requires the glue to set up an isolated context, which is buried in pip] in the general case - no matter what we do, pip can't benefit from it beyond the subprocess interface pip needs, because pip *cannot* import and use the build interface tl;dr - I think making the case that the layer we define should be a Python protocol rather than a subprocess protocol requires some really strong evidence. We're *not* dealing with the same moving parts that typical Python stuff requires.
However, I've now realised that we're not constrained even if we start with the CLI interface, as there's still a migration path to a Python API based model:
Now: documented CLI for invoking build systems Future: documented Python API for invoking build systems, default fallback invokes the documented CLI
Or we just issue an updated bootstrap schema, and there's no fallback or anything needed.
So the CLI documented in the PEP isn't *necessarily* going to be the one used by pip to communicate into the build environment - it may be invoked locally within the build environment.
No, it totally will be. Exactly as setup.py is today. Thats
deliberate: The *new* thing we're setting out to enable is abstract
build systems, not reengineering pip.
The future - sure, someone can write a new thing, and the necessary
capability we're building in to allow future changes will allow a new
PEP to slot in easily and take on that [non trivial and substantial
chunk of work]. (For instance, how do you do compiler and build system
specific options when you have a CLI to talk to pip with)?
-Rob
--
Robert Collins
On Tue, Nov 10, 2015 at 11:27 PM, Robert Collins
On 11 November 2015 at 19:49, Nick Coghlan
wrote: On 11 November 2015 at 16:19, Robert Collins
wrote: ...>> pip is going to be invoking a CLI *no matter what*. Thats a hard requirement unless Python's very fundamental import behaviour changes. Slapping a Python API on things is lipstick on a pig here IMO: we're going to have to downgrade any richer interface; and by specifying the actual LCD as the interface it is then amenable to direct exploration by users without them having to reverse engineer an undocumented thunk within pip.
I'm not opposed to documenting how pip talks to its worker CLI - I just share Nathan's concerns about locking that down in a PEP vs keeping *that* CLI within pip's boundary of responsibilities, and having a documented Python interface used for invoking build systems.
I'm also very wary of something that would be an attractive nuisance. I've seen nothing suggesting that a Python API would be anything but: - it won't be usable [it requires the glue to set up an isolated context, which is buried in pip] in the general case
This is exactly as true of a command line API -- in the general case it also requires the glue to set up an isolated context. People who go ahead and run 'flit' from their global environment instead of in the isolated build environment will experience exactly the same problems as people who go ahead and import 'flit.build_system_api' in their global environment, so I don't see how one is any more of an attractive nuisance than the other? AFAICT the main difference is that "setting up a specified Python context and then importing something and exploring its API" is literally what I do all day as a Python developer. Either way you have to set stuff up, and then once you do, in the Python API case you get stuff like tab completion, ipython introspection (? and ??), etc. for free.
- no matter what we do, pip can't benefit from it beyond the subprocess interface pip needs, because pip *cannot* import and use the build interface
Not sure what you mean by "benefit" here. At best this is an argument that the two options have similar capabilities, in which case I would argue that we should choose the one that leads to simpler and thus more probably bug-free specification language. But even this isn't really true -- the difference between them is that either way you have a subprocess API, but with a Python API, the subprocess interface that pip uses has the option of being improved incrementally over time -- including, potentially, to take further advantage of the underlying richness of the Python semantics. Sure, maybe the first release would just take all exceptions and map them into some text printed to stderr and a non-zero return code, and that's all that pip would get. But if someone had an idea for how pip could do better than this by, I dunno, encoding some structured metadata about the particular exception that occurred and passing this back up to pip to do something intelligent with it, they absolutely could write the code and submit a PR to pip, without having to write a new PEP.
tl;dr - I think making the case that the layer we define should be a Python protocol rather than a subprocess protocol requires some really strong evidence. We're *not* dealing with the same moving parts that typical Python stuff requires.
I'm very confused and honestly do not understand what you find attractive about the subprocess protocol approach. Even your arguments above aren't really even trying to be arguments that it's good, just arguments that the Python API approach isn't much better. I'm sure there is some reason you like it, and you might even have said it but I missed it because I disagreed or something :-). But literally the only reason I can think of right now for why one would prefer the subprocess approach is that it lets one remove 50 lines of "worker process" code from pip and move them into the individual build backends instead, which I guess is a win if one is focused narrowly on pip itself. But surely there is more I'm missing? (And even this is lines-of-code argument is actually pretty dubious -- right now your draft PEP is importing-by-reference an entire existing codebase (!) for shell variable expansion in command lines, which is code that simply doesn't need to exist in the Python API approach. I'd be willing to bet that your approach requires more code in pip than mine :-).)
However, I've now realised that we're not constrained even if we start with the CLI interface, as there's still a migration path to a Python API based model:
Now: documented CLI for invoking build systems Future: documented Python API for invoking build systems, default fallback invokes the documented CLI
Or we just issue an updated bootstrap schema, and there's no fallback or anything needed.
Oh no! But this totally gives up the most brilliant part of your original idea! :-) In my original draft, I had each hook specified separately in the bootstrap file, e.g. (super schematically): build-requirements = flit-build-requirements do-wheel-build = flit-do-wheel-build do-editable-build = flit-do-editable build and you counterproposed that instead there should just be one line like build-system = flit-build-system and this is exactly right, because it means that if some new capability is added to the spec (e.g. a new hook -- like hypothetically imagine if we ended up deferring the equivalent of egg-info or editable-build-mode to v2), then the new capability just needs to be implemented in pip and in flit, and then all the projects that use flit immediately gain superpowers without anyone having to go around and manually change all the bootstrap files in every project individually. But for this to work it's crucial that the pip<->build-system interface have some sort of versioning or negotiation beyond the bootstrap file's schema version.
So the CLI documented in the PEP isn't *necessarily* going to be the one used by pip to communicate into the build environment - it may be invoked locally within the build environment.
No, it totally will be. Exactly as setup.py is today. Thats deliberate: The *new* thing we're setting out to enable is abstract build systems, not reengineering pip.
The future - sure, someone can write a new thing, and the necessary capability we're building in to allow future changes will allow a new PEP to slot in easily and take on that [non trivial and substantial chunk of work]. (For instance, how do you do compiler and build system specific options when you have a CLI to talk to pip with)?
I dunno, that seems pretty easy? My original draft just suggested that the build hook would take a dict of string-valued keys, and then we'd add some options to pip like "--project-build-option foo=bar" that would set entries in that dict, and that's pretty much sufficient to get the job done. To enable backcompat you'd also want to map the old --install-option and --build-option switches to add entries to some well-known keys in that dict. But none of the details here need to be specified, because it's up to individual projects/build-systems to assign meaning to this stuff and individual build-frontends like pip to provide an interface to it -- at the build-frontent/build-backend interface layer we just need some way to pass through the blobs. I admit that this is another case where the Python API approach is making things trivial though ;-). If you want to pass arbitrary user-specified data through a command-line API, while avoiding things like potential namespace collisions between user-defined switches and standard-defined switches, then you have to do much more work than just say "there's another argument that's a dict". -n -- Nathaniel J. Smith -- http://vorpus.org
In case it's useful to make this discussion more concrete, here's a
sketch of what the pip code for dealing with a build system defined by
a Python API might look like:
https://gist.github.com/njsmith/75818a6debbce9d7ff48
Obviously there's room to build on this to get much fancier, but
AFAICT even this minimal version is already enough to correctly handle
all the important stuff -- schema version checking, error reporting,
full args/kwargs/return values. (It does assume that we'll only use
json-serializable data structures for argument and return values, but
that seems like a good plan anyway. Pickle would probably be a bad
idea because we're crossing between two different python environments
that may have different or incompatible packages/classes available.)
-n
On Wed, Nov 11, 2015 at 1:04 AM, Nathaniel Smith
On Tue, Nov 10, 2015 at 11:27 PM, Robert Collins
wrote: On 11 November 2015 at 19:49, Nick Coghlan
wrote: On 11 November 2015 at 16:19, Robert Collins
wrote: ...>> pip is going to be invoking a CLI *no matter what*. Thats a hard requirement unless Python's very fundamental import behaviour changes. Slapping a Python API on things is lipstick on a pig here IMO: we're going to have to downgrade any richer interface; and by specifying the actual LCD as the interface it is then amenable to direct exploration by users without them having to reverse engineer an undocumented thunk within pip.
I'm not opposed to documenting how pip talks to its worker CLI - I just share Nathan's concerns about locking that down in a PEP vs keeping *that* CLI within pip's boundary of responsibilities, and having a documented Python interface used for invoking build systems.
I'm also very wary of something that would be an attractive nuisance. I've seen nothing suggesting that a Python API would be anything but: - it won't be usable [it requires the glue to set up an isolated context, which is buried in pip] in the general case
This is exactly as true of a command line API -- in the general case it also requires the glue to set up an isolated context. People who go ahead and run 'flit' from their global environment instead of in the isolated build environment will experience exactly the same problems as people who go ahead and import 'flit.build_system_api' in their global environment, so I don't see how one is any more of an attractive nuisance than the other?
AFAICT the main difference is that "setting up a specified Python context and then importing something and exploring its API" is literally what I do all day as a Python developer. Either way you have to set stuff up, and then once you do, in the Python API case you get stuff like tab completion, ipython introspection (? and ??), etc. for free.
- no matter what we do, pip can't benefit from it beyond the subprocess interface pip needs, because pip *cannot* import and use the build interface
Not sure what you mean by "benefit" here. At best this is an argument that the two options have similar capabilities, in which case I would argue that we should choose the one that leads to simpler and thus more probably bug-free specification language.
But even this isn't really true -- the difference between them is that either way you have a subprocess API, but with a Python API, the subprocess interface that pip uses has the option of being improved incrementally over time -- including, potentially, to take further advantage of the underlying richness of the Python semantics. Sure, maybe the first release would just take all exceptions and map them into some text printed to stderr and a non-zero return code, and that's all that pip would get. But if someone had an idea for how pip could do better than this by, I dunno, encoding some structured metadata about the particular exception that occurred and passing this back up to pip to do something intelligent with it, they absolutely could write the code and submit a PR to pip, without having to write a new PEP.
tl;dr - I think making the case that the layer we define should be a Python protocol rather than a subprocess protocol requires some really strong evidence. We're *not* dealing with the same moving parts that typical Python stuff requires.
I'm very confused and honestly do not understand what you find attractive about the subprocess protocol approach. Even your arguments above aren't really even trying to be arguments that it's good, just arguments that the Python API approach isn't much better. I'm sure there is some reason you like it, and you might even have said it but I missed it because I disagreed or something :-). But literally the only reason I can think of right now for why one would prefer the subprocess approach is that it lets one remove 50 lines of "worker process" code from pip and move them into the individual build backends instead, which I guess is a win if one is focused narrowly on pip itself. But surely there is more I'm missing?
(And even this is lines-of-code argument is actually pretty dubious -- right now your draft PEP is importing-by-reference an entire existing codebase (!) for shell variable expansion in command lines, which is code that simply doesn't need to exist in the Python API approach. I'd be willing to bet that your approach requires more code in pip than mine :-).)
However, I've now realised that we're not constrained even if we start with the CLI interface, as there's still a migration path to a Python API based model:
Now: documented CLI for invoking build systems Future: documented Python API for invoking build systems, default fallback invokes the documented CLI
Or we just issue an updated bootstrap schema, and there's no fallback or anything needed.
Oh no! But this totally gives up the most brilliant part of your original idea! :-)
In my original draft, I had each hook specified separately in the bootstrap file, e.g. (super schematically):
build-requirements = flit-build-requirements do-wheel-build = flit-do-wheel-build do-editable-build = flit-do-editable build
and you counterproposed that instead there should just be one line like
build-system = flit-build-system
and this is exactly right, because it means that if some new capability is added to the spec (e.g. a new hook -- like hypothetically imagine if we ended up deferring the equivalent of egg-info or editable-build-mode to v2), then the new capability just needs to be implemented in pip and in flit, and then all the projects that use flit immediately gain superpowers without anyone having to go around and manually change all the bootstrap files in every project individually.
But for this to work it's crucial that the pip<->build-system interface have some sort of versioning or negotiation beyond the bootstrap file's schema version.
So the CLI documented in the PEP isn't *necessarily* going to be the one used by pip to communicate into the build environment - it may be invoked locally within the build environment.
No, it totally will be. Exactly as setup.py is today. Thats deliberate: The *new* thing we're setting out to enable is abstract build systems, not reengineering pip.
The future - sure, someone can write a new thing, and the necessary capability we're building in to allow future changes will allow a new PEP to slot in easily and take on that [non trivial and substantial chunk of work]. (For instance, how do you do compiler and build system specific options when you have a CLI to talk to pip with)?
I dunno, that seems pretty easy? My original draft just suggested that the build hook would take a dict of string-valued keys, and then we'd add some options to pip like "--project-build-option foo=bar" that would set entries in that dict, and that's pretty much sufficient to get the job done. To enable backcompat you'd also want to map the old --install-option and --build-option switches to add entries to some well-known keys in that dict. But none of the details here need to be specified, because it's up to individual projects/build-systems to assign meaning to this stuff and individual build-frontends like pip to provide an interface to it -- at the build-frontent/build-backend interface layer we just need some way to pass through the blobs.
I admit that this is another case where the Python API approach is making things trivial though ;-). If you want to pass arbitrary user-specified data through a command-line API, while avoiding things like potential namespace collisions between user-defined switches and standard-defined switches, then you have to do much more work than just say "there's another argument that's a dict".
-n
-- Nathaniel J. Smith -- http://vorpus.org
-- Nathaniel J. Smith -- http://vorpus.org
On November 11, 2015 at 4:05:11 AM, Nathaniel Smith (njs@pobox.com) wrote:
But even this isn't really true -- the difference between them is that either way you have a subprocess API, but with a Python API, the subprocess interface that pip uses has the option of being improved incrementally over time -- including, potentially, to take further advantage of the underlying richness of the Python semantics. Sure, maybe the first release would just take all exceptions and map them into some text printed to stderr and a non-zero return code, and that's all that pip would get. But if someone had an idea for how pip could do better than this by, I dunno, encoding some structured metadata about the particular exception that occurred and passing this back up to pip to do something intelligent with it, they absolutely could write the code and submit a PR to pip, without having to write a new PEP.
I think I prefer a CLI based approach (my suggestion was to remove the formatting/interpolation all together and just have the file include a list of things to install, and a python module to invoke via ``python -m <thing provided by user>``). The main reason I think I prefer a CLI based approach is that I worry about the impedance mismatch between the two systems. We’re not actually going to be able to take advantage of Python’s plethora of types in any meaningful capacity because at the end of the day the bulk of the data is either naturally a string or as we start to allow end users to pass options through pip into the build system, we have no real way of knowing what the type is supposed to be other than the fact we got it as a CLI flag. How does a user encode something like “pass an integer into this value in the build system?” on the CLI in a generic way? I can’t think of any way which means that any boundary code in the build system is going to need to be smart enough to handle an array of arguments that come in via the user typing something on the CLI. We have a wide variety of libraries to handle that case already for building CLI apps but we do not have a wide array of libraries handling it for a Python API. It will have to be manually encoded for each and every option that the build system supports. My other concern is that it introduces another potential area for mistake that is a bit harder to test. I don’t believe that any sort of “worker.py” script is ever going to be able to handle arbitrary Python values coming back as a return value from a Python script. Whatever serialization we use to send data back into the main pip process (likely JSON) will simply choke and cause an error if it encounters a type it doesn’t know how to serialize. However this error case will only happen when the build system is being invoked by pip, not when it is being invoked “naturally” in the build system’s unit tests. By forcing build tool authors to write a CLI interface, we push the work of “how do I serialize my internal data structures” down onto them instead of making it some implicit piece of code that pip needs to work. The other reason I think a CLI approach is nicer is that it gives us a standard interface that we can us to have defined errors that the build system can omit. For instance if we wanted to allow the build system to indicate that it can’t do a build because it’s missing a mandatory C library, that would be trivial to do in a natural way for a CLI approach, we just define an error code and say that if the CLI exits with a 2 then we assume it’s missing a mandatory C library and we can take additional measures in pip to handle that case. If we use a Python API the natural way to signal an error like that is using an exception… but we don’t have any way to force a standard exception hierarchy on people. There is no “Missing C Library Exception” in Python so either we’d have to encode some numerical or string based identifier that we’ll inspect an exception for (like Exception().error_code) or we’ll need to make a mandatory runtime library that the build systems must utilize to get their exceptions from. Alternatively we could have the calling functions return exit codes as well just like a process boundary does, however that is also not natural in Python and is more natural in a language like C. The main downside to the CLI approach is that it’s harder for the build system to send structured information back to the calling process outside of defined error code. However I do not believe that is particularly difficult since we can have it do something like send messages on stdout that are JSON encoded messages that pip can process and understand. I don’t think that it’s a requirement or even useful that the same CLI that end users would use to directly invoke that build system is the same one that pip would use to invoke that build system. So we wouldn’t need to worry about the fact that a bunch of JSON blobs being put on stdout isn’t very user friendly, because the user isn’t the target of these commands, pip is. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
As much as I dislike sniping into threads like this, my gut feeling is strongly pushing towards defining the Python interface in the PEP and keeping command line interfaces as private.
I don't have any new evidence, but pickle and binary stdio (not to mention TCP/HTTP for doing things remotely) are reliable cross-platform where CLIs are not, so you're going to have a horrible time locking down something that will work across multiple OS/shell combinations. There are also limits to command lines lengths that may be triggered when passing many long paths (if that ends up in there).
Might be nice to have an in-proc option for builders too, so I can handle the IPC in my own way. Maybe that's not useful, but with a Python interface it's trivial to enable.
Cheers,
Steve
Top-posted from my Windows Phone
-----Original Message-----
From: "Nathaniel Smith"
On Tue, Nov 10, 2015 at 11:27 PM, Robert Collins
wrote: On 11 November 2015 at 19:49, Nick Coghlan
wrote: On 11 November 2015 at 16:19, Robert Collins
wrote: ...>> pip is going to be invoking a CLI *no matter what*. Thats a hard requirement unless Python's very fundamental import behaviour changes. Slapping a Python API on things is lipstick on a pig here IMO: we're going to have to downgrade any richer interface; and by specifying the actual LCD as the interface it is then amenable to direct exploration by users without them having to reverse engineer an undocumented thunk within pip.
I'm not opposed to documenting how pip talks to its worker CLI - I just share Nathan's concerns about locking that down in a PEP vs keeping *that* CLI within pip's boundary of responsibilities, and having a documented Python interface used for invoking build systems.
I'm also very wary of something that would be an attractive nuisance. I've seen nothing suggesting that a Python API would be anything but: - it won't be usable [it requires the glue to set up an isolated context, which is buried in pip] in the general case
This is exactly as true of a command line API -- in the general case it also requires the glue to set up an isolated context. People who go ahead and run 'flit' from their global environment instead of in the isolated build environment will experience exactly the same problems as people who go ahead and import 'flit.build_system_api' in their global environment, so I don't see how one is any more of an attractive nuisance than the other?
AFAICT the main difference is that "setting up a specified Python context and then importing something and exploring its API" is literally what I do all day as a Python developer. Either way you have to set stuff up, and then once you do, in the Python API case you get stuff like tab completion, ipython introspection (? and ??), etc. for free.
- no matter what we do, pip can't benefit from it beyond the subprocess interface pip needs, because pip *cannot* import and use the build interface
Not sure what you mean by "benefit" here. At best this is an argument that the two options have similar capabilities, in which case I would argue that we should choose the one that leads to simpler and thus more probably bug-free specification language.
But even this isn't really true -- the difference between them is that either way you have a subprocess API, but with a Python API, the subprocess interface that pip uses has the option of being improved incrementally over time -- including, potentially, to take further advantage of the underlying richness of the Python semantics. Sure, maybe the first release would just take all exceptions and map them into some text printed to stderr and a non-zero return code, and that's all that pip would get. But if someone had an idea for how pip could do better than this by, I dunno, encoding some structured metadata about the particular exception that occurred and passing this back up to pip to do something intelligent with it, they absolutely could write the code and submit a PR to pip, without having to write a new PEP.
tl;dr - I think making the case that the layer we define should be a Python protocol rather than a subprocess protocol requires some really strong evidence. We're *not* dealing with the same moving parts that typical Python stuff requires.
I'm very confused and honestly do not understand what you find attractive about the subprocess protocol approach. Even your arguments above aren't really even trying to be arguments that it's good, just arguments that the Python API approach isn't much better. I'm sure there is some reason you like it, and you might even have said it but I missed it because I disagreed or something :-). But literally the only reason I can think of right now for why one would prefer the subprocess approach is that it lets one remove 50 lines of "worker process" code from pip and move them into the individual build backends instead, which I guess is a win if one is focused narrowly on pip itself. But surely there is more I'm missing?
(And even this is lines-of-code argument is actually pretty dubious -- right now your draft PEP is importing-by-reference an entire existing codebase (!) for shell variable expansion in command lines, which is code that simply doesn't need to exist in the Python API approach. I'd be willing to bet that your approach requires more code in pip than mine :-).)
However, I've now realised that we're not constrained even if we start with the CLI interface, as there's still a migration path to a Python API based model:
Now: documented CLI for invoking build systems Future: documented Python API for invoking build systems, default fallback invokes the documented CLI
Or we just issue an updated bootstrap schema, and there's no fallback or anything needed.
Oh no! But this totally gives up the most brilliant part of your original idea! :-)
In my original draft, I had each hook specified separately in the bootstrap file, e.g. (super schematically):
build-requirements = flit-build-requirements do-wheel-build = flit-do-wheel-build do-editable-build = flit-do-editable build
and you counterproposed that instead there should just be one line like
build-system = flit-build-system
and this is exactly right, because it means that if some new capability is added to the spec (e.g. a new hook -- like hypothetically imagine if we ended up deferring the equivalent of egg-info or editable-build-mode to v2), then the new capability just needs to be implemented in pip and in flit, and then all the projects that use flit immediately gain superpowers without anyone having to go around and manually change all the bootstrap files in every project individually.
But for this to work it's crucial that the pip<->build-system interface have some sort of versioning or negotiation beyond the bootstrap file's schema version.
So the CLI documented in the PEP isn't *necessarily* going to be the one used by pip to communicate into the build environment - it may be invoked locally within the build environment.
No, it totally will be. Exactly as setup.py is today. Thats deliberate: The *new* thing we're setting out to enable is abstract build systems, not reengineering pip.
The future - sure, someone can write a new thing, and the necessary capability we're building in to allow future changes will allow a new PEP to slot in easily and take on that [non trivial and substantial chunk of work]. (For instance, how do you do compiler and build system specific options when you have a CLI to talk to pip with)?
I dunno, that seems pretty easy? My original draft just suggested that the build hook would take a dict of string-valued keys, and then we'd add some options to pip like "--project-build-option foo=bar" that would set entries in that dict, and that's pretty much sufficient to get the job done. To enable backcompat you'd also want to map the old --install-option and --build-option switches to add entries to some well-known keys in that dict. But none of the details here need to be specified, because it's up to individual projects/build-systems to assign meaning to this stuff and individual build-frontends like pip to provide an interface to it -- at the build-frontent/build-backend interface layer we just need some way to pass through the blobs.
I admit that this is another case where the Python API approach is making things trivial though ;-). If you want to pass arbitrary user-specified data through a command-line API, while avoiding things like potential namespace collisions between user-defined switches and standard-defined switches, then you have to do much more work than just say "there's another argument that's a dict".
-n
-- Nathaniel J. Smith -- http://vorpus.org
-- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On November 11, 2015 at 7:51:05 AM, Steve Dower (steve.dower@python.org) wrote:
As much as I dislike sniping into threads like this, my gut feeling is strongly pushing towards defining the Python interface in the PEP and keeping command line interfaces as private.
I don't have any new evidence, but pickle and binary stdio (not to mention TCP/HTTP for doing things remotely) are reliable cross-platform where CLIs are not, so you're going to have a horrible time locking down something that will work across multiple OS/shell combinations. There are also limits to command lines lengths that may be triggered when passing many long paths (if that ends up in there).
The flip side is we are already successfully creating a cross-platform CLI via setup.py. It’s not like that is some new thing that we’ve not been handling for like two decades already. Pickle makes me nervous because it’s trivial for something to “leak” out of the subprocess into the main process that shouldn’t. For example, if we implement isolated builds then we might end up having a build tool like “mycoolbuildthing” installed not into the same location as pip, but added to PYTHONPATH when invoking the build tool. The build tool then returns some internally defined class as part of it’s interface and pickle dutifully serializes that. Then when we go to deserialize that in the main pip process, it blows up and fails because we don’t have “mycoolbuildthing” installed. I could see an in language API if Python had a history of typed interfaces where we could write an interface that said “it is an error for this interface to ever return anything but a True/False” or some other such rule. However Python doesn’t and duck typing works against us here because build tool authors will have to be aware of how we’re serializing the results across the IPC boundary without actually having that IPC being defined.
Might be nice to have an in-proc option for builders too, so I can handle the IPC in my own way. Maybe that's not useful, but with a Python interface it's trivial to enable.
Cheers, Steve
Top-posted from my Windows Phone
-----Original Message----- From: "Nathaniel Smith" Sent: 11/11/2015 4:18 To: "Robert Collins" Cc: "DistUtils mailing list" Subject: Re: [Distutils] command line versus python API for build systemabstraction (was Re: build system abstraction PEP)
In case it's useful to make this discussion more concrete, here's a sketch of what the pip code for dealing with a build system defined by a Python API might look like:
https://gist.github.com/njsmith/75818a6debbce9d7ff48
Obviously there's room to build on this to get much fancier, but AFAICT even this minimal version is already enough to correctly handle all the important stuff -- schema version checking, error reporting, full args/kwargs/return values. (It does assume that we'll only use json-serializable data structures for argument and return values, but that seems like a good plan anyway. Pickle would probably be a bad idea because we're crossing between two different python environments that may have different or incompatible packages/classes available.)
-n
On Wed, Nov 11, 2015 at 1:04 AM, Nathaniel Smith wrote:
On Tue, Nov 10, 2015 at 11:27 PM, Robert Collins wrote:
On 11 November 2015 at 19:49, Nick Coghlan wrote:
On 11 November 2015 at 16:19, Robert Collins wrote: ...>> pip is going to be invoking a CLI *no matter what*. Thats a hard
requirement unless Python's very fundamental import behaviour changes. Slapping a Python API on things is lipstick on a pig here IMO: we're going to have to downgrade any richer interface; and by specifying the actual LCD as the interface it is then amenable to direct exploration by users without them having to reverse engineer an undocumented thunk within pip.
I'm not opposed to documenting how pip talks to its worker CLI - I just share Nathan's concerns about locking that down in a PEP vs keeping *that* CLI within pip's boundary of responsibilities, and having a documented Python interface used for invoking build systems.
I'm also very wary of something that would be an attractive nuisance. I've seen nothing suggesting that a Python API would be anything but: - it won't be usable [it requires the glue to set up an isolated context, which is buried in pip] in the general case
This is exactly as true of a command line API -- in the general case it also requires the glue to set up an isolated context. People who go ahead and run 'flit' from their global environment instead of in the isolated build environment will experience exactly the same problems as people who go ahead and import 'flit.build_system_api' in their global environment, so I don't see how one is any more of an attractive nuisance than the other?
AFAICT the main difference is that "setting up a specified Python context and then importing something and exploring its API" is literally what I do all day as a Python developer. Either way you have to set stuff up, and then once you do, in the Python API case you get stuff like tab completion, ipython introspection (? and ??), etc. for free.
- no matter what we do, pip can't benefit from it beyond the subprocess interface pip needs, because pip *cannot* import and use the build interface
Not sure what you mean by "benefit" here. At best this is an argument that the two options have similar capabilities, in which case I would argue that we should choose the one that leads to simpler and thus more probably bug-free specification language.
But even this isn't really true -- the difference between them is that either way you have a subprocess API, but with a Python API, the subprocess interface that pip uses has the option of being improved incrementally over time -- including, potentially, to take further advantage of the underlying richness of the Python semantics. Sure, maybe the first release would just take all exceptions and map them into some text printed to stderr and a non-zero return code, and that's all that pip would get. But if someone had an idea for how pip could do better than this by, I dunno, encoding some structured metadata about the particular exception that occurred and passing this back up to pip to do something intelligent with it, they absolutely could write the code and submit a PR to pip, without having to write a new PEP.
tl;dr - I think making the case that the layer we define should be a Python protocol rather than a subprocess protocol requires some really strong evidence. We're *not* dealing with the same moving parts that typical Python stuff requires.
I'm very confused and honestly do not understand what you find attractive about the subprocess protocol approach. Even your arguments above aren't really even trying to be arguments that it's good, just arguments that the Python API approach isn't much better. I'm sure there is some reason you like it, and you might even have said it but I missed it because I disagreed or something :-). But literally the only reason I can think of right now for why one would prefer the subprocess approach is that it lets one remove 50 lines of "worker process" code from pip and move them into the individual build backends instead, which I guess is a win if one is focused narrowly on pip itself. But surely there is more I'm missing?
(And even this is lines-of-code argument is actually pretty dubious -- right now your draft PEP is importing-by-reference an entire existing codebase (!) for shell variable expansion in command lines, which is code that simply doesn't need to exist in the Python API approach. I'd be willing to bet that your approach requires more code in pip than mine :-).)
However, I've now realised that we're not constrained even if we start with the CLI interface, as there's still a migration path to a Python API based model:
Now: documented CLI for invoking build systems Future: documented Python API for invoking build systems, default fallback invokes the documented CLI
Or we just issue an updated bootstrap schema, and there's no fallback or anything needed.
Oh no! But this totally gives up the most brilliant part of your original idea! :-)
In my original draft, I had each hook specified separately in the bootstrap file, e.g. (super schematically):
build-requirements = flit-build-requirements do-wheel-build = flit-do-wheel-build do-editable-build = flit-do-editable build
and you counterproposed that instead there should just be one line like
build-system = flit-build-system
and this is exactly right, because it means that if some new capability is added to the spec (e.g. a new hook -- like hypothetically imagine if we ended up deferring the equivalent of egg-info or editable-build-mode to v2), then the new capability just needs to be implemented in pip and in flit, and then all the projects that use flit immediately gain superpowers without anyone having to go around and manually change all the bootstrap files in every project individually.
But for this to work it's crucial that the pip<->build-system interface have some sort of versioning or negotiation beyond the bootstrap file's schema version.
So the CLI documented in the PEP isn't *necessarily* going to be the one used by pip to communicate into the build environment - it may be invoked locally within the build environment.
No, it totally will be. Exactly as setup.py is today. Thats deliberate: The *new* thing we're setting out to enable is abstract build systems, not reengineering pip.
The future - sure, someone can write a new thing, and the necessary capability we're building in to allow future changes will allow a new PEP to slot in easily and take on that [non trivial and substantial chunk of work]. (For instance, how do you do compiler and build system specific options when you have a CLI to talk to pip with)?
I dunno, that seems pretty easy? My original draft just suggested that the build hook would take a dict of string-valued keys, and then we'd add some options to pip like "--project-build-option foo=bar" that would set entries in that dict, and that's pretty much sufficient to get the job done. To enable backcompat you'd also want to map the old --install-option and --build-option switches to add entries to some well-known keys in that dict. But none of the details here need to be specified, because it's up to individual projects/build-systems to assign meaning to this stuff and individual build-frontends like pip to provide an interface to it -- at the build-frontent/build-backend interface layer we just need some way to pass through the blobs.
I admit that this is another case where the Python API approach is making things trivial though ;-). If you want to pass arbitrary user-specified data through a command-line API, while avoiding things like potential namespace collisions between user-defined switches and standard-defined switches, then you have to do much more work than just say "there's another argument that's a dict".
-n
-- Nathaniel J. Smith -- http://vorpus.org
-- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
On 10 November 2015 at 22:44, Nathaniel Smith
"Stdin is unspecified, and stdout/stderr can be used for printing status messages, errors, etc. just like you're used to from every other build system in the world."
This is over simplistic. We have real-world requirements from users of pip that they *don't* want to see all of the progress that the various build tools invoke. That is not something we can ignore. We also have some users saying they want access to all of the build tool output. And we also have a requirement for progress reporting. Taking all of those requirements into account, pip *has* to have some level of control over the output of a build tool - with setuptools at the moment, we have no such control (other than "we may or may not show the output to the user") and that means we struggle to realistically satisfy all of the conflicting requirements we have. So we do need much better defined contracts over stdin, stdout and stderr, and return codes. This is true whether or not the build system is invoked via a Python API or a CLI. Paul
On 12 November 2015 at 02:30, Paul Moore
On 10 November 2015 at 22:44, Nathaniel Smith
wrote: "Stdin is unspecified, and stdout/stderr can be used for printing status messages, errors, etc. just like you're used to from every other build system in the world."
This is over simplistic.
We have real-world requirements from users of pip that they *don't* want to see all of the progress that the various build tools invoke. That is not something we can ignore. We also have some users saying they want access to all of the build tool output. And we also have a requirement for progress reporting.
Taking all of those requirements into account, pip *has* to have some level of control over the output of a build tool - with setuptools at the moment, we have no such control (other than "we may or may not show the output to the user") and that means we struggle to realistically satisfy all of the conflicting requirements we have.
So we do need much better defined contracts over stdin, stdout and stderr, and return codes. This is true whether or not the build system is invoked via a Python API or a CLI.
Aye.
I'd like everyone to take a breather on this thread btw. I'm focusing
on the dependency specification PEP and until thats at the point I
can't move it forward, I won't be updating the draft build abstraction
PEP: when thats done, with the thing Donald and I hammered out on IRC
a few days back (Option 3, earlier) then we'll have something to talk
about and consider.
-Rob
--
Robert Collins
On Nov 11, 2015 5:30 AM, "Paul Moore"
On 10 November 2015 at 22:44, Nathaniel Smith
wrote: "Stdin is unspecified, and stdout/stderr can be used for printing status messages, errors, etc. just like you're used to from every other build system in the world."
This is over simplistic.
We have real-world requirements from users of pip that they *don't* want to see all of the progress that the various build tools invoke. That is not something we can ignore. We also have some users saying they want access to all of the build tool output. And we also have a requirement for progress reporting.
Have you tried current dev versions of pip recently? The default now is to suppress the actual output but for progress reporting to show a spinner that rotates each time a line of text would have been printed. It's low tech but IMHO very effective. (And obviously you can also flip a switch to either see all or nothing of the output as well, or if that isn't there now if books really be added.) So I kinda feel like these are solved problems.
Taking all of those requirements into account, pip *has* to have some level of control over the output of a build tool - with setuptools at the moment, we have no such control (other than "we may or may not show the output to the user") and that means we struggle to realistically satisfy all of the conflicting requirements we have.
So we do need much better defined contracts over stdin, stdout and stderr, and return codes. This is true whether or not the build system is invoked via a Python API or a CLI.
Even if you really do want to define a generic structured system for build progress reporting (it feels pretty second-systemy to me), then in the python api approach there are better options than trying to define a specific protocol on stdout. Guaranteeing a clean stdout/stderr is hard: it means you have to be careful to correctly capture and process the output of every child you invoke (e.g. compilers), and deal correctly with the tricky aspects of pipes (deadlocks, sigpipe, ...). And even then you can get thwarted by accidentally importing the wrong library into your main process, and discovering that it writes directly to stdout/stderr on some error condition. And it may or may not respect your resetting of sys.stdout/sys.stderr at the python level. So to be really reliable the only thing to do is to create some pipes and some threads to read the pipes and do the dup2 dance (but not everyone will actually do this, they'll just accept corrupted output on errors) and ugh, all of this is a huge hassle that massively raises the bar on implementing simple build systems. In the subprocess approach you don't really have many options; if you want live feedback from a build process then you have to get it somehow, and you can't just say "fine part of the protocol is that we use fd 3 for structured status updates" because that doesn't work on windows. In the python api approach, we have better options, though. The way I'd do this is to define some of progress reporting abstract interface, like class BuildUpdater: # pass -1 for "unknown" def set_total_steps(self, n): pass # if total is unknown, call this repeatedly to say "something's happening" def set_current_step(self, n): pass def alert_user(self, message): pass And methods like build_wheel would accept an object implementing this interface as an argument. Stdout/stderr keep the same semantics as they have today; this is a separate, additional channel. And then a build frontend could decide how it wants to actually implement this interface. A simple frontend that didn't want to implement fancy UI stuff might just have each of those methods print something to stderr to be captured along with the rest of the chatter. A fancier frontend like pip could pick whichever ipc mechanism they like best and implement that inside their worker. (E.g., maybe on POSIX we use fd 3, and on windows we do incremental writes to a temp file, or use a named pipe. Or maybe we prefer to stick to using stdout for pip<->worker communication, and the worker would take the responsibility of robustly redirecting stdout via dup2 before invoking the actual build hook. There are lots of options; the beauty of the approach, again, is that we don't have to pick one now and write it in stone.) -n
On November 11, 2015 at 1:38:38 PM, Nathaniel Smith (njs@pobox.com) wrote:
Guaranteeing a clean stdout/stderr is hard: it means you have to be careful to correctly capture and process the output of every child you invoke (e.g. compilers), and deal correctly with the tricky aspects of pipes (deadlocks, sigpipe, ...). And even then you can get thwarted by accidentally importing the wrong library into your main process, and discovering that it writes directly to stdout/stderr on some error condition. And it may or may not respect your resetting of sys.stdout/sys.stderr at the python level. So to be really reliable the only thing to do is to create some pipes and some threads to read the pipes and do the dup2 dance (but not everyone will actually do this, they'll just accept corrupted output on errors) and ugh, all of this is a huge hassle that massively raises the bar on implementing simple build systems.
How is this not true for a worker.py process as well? If the worker process communicates via stdout then it has to make sure it captures the stdout and redirects it before calling into the Python API and then undoes that afterwords. It makes it harder to do incremental output actually because a Python function can’t return in the middle of execution so we’d need to make it some sort of akward generator protocol to make that happen too. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
On Nov 11, 2015 12:31 PM, "Robert Collins"
On 12 November 2015 at 02:30, Paul Moore
wrote: On 10 November 2015 at 22:44, Nathaniel Smith
wrote: "Stdin is unspecified, and stdout/stderr can be used for printing status messages, errors, etc. just like you're used to from every other build system in the world."
This is over simplistic.
We have real-world requirements from users of pip that they *don't* want to see all of the progress that the various build tools invoke. That is not something we can ignore. We also have some users saying they want access to all of the build tool output. And we also have a requirement for progress reporting.
Taking all of those requirements into account, pip *has* to have some level of control over the output of a build tool - with setuptools at the moment, we have no such control (other than "we may or may not show the output to the user") and that means we struggle to realistically satisfy all of the conflicting requirements we have.
So we do need much better defined contracts over stdin, stdout and stderr, and return codes. This is true whether or not the build system is invoked via a Python API or a CLI.
Aye.
I'd like everyone to take a breather on this thread btw. I'm focusing on the dependency specification PEP and until thats at the point I can't move it forward, I won't be updating the draft build abstraction PEP:
Presumably, it would be great to list a platform parameter description as JSONLD-serializable keys and values (e.g. for a bdist/wheel build "imprint" in the JSONLD build metadata composition file) ... #PEP426JSONLD
when thats done, with the thing Donald and I hammered out on IRC a few days back (Option 3, earlier) then we'll have something to talk about and consider.
-Rob
-- Robert Collins
Distinguished Technologist HP Converged Cloud _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On Wed, Nov 11, 2015 at 10:42 AM, Donald Stufft
On November 11, 2015 at 1:38:38 PM, Nathaniel Smith (njs@pobox.com) wrote:
Guaranteeing a clean stdout/stderr is hard: it means you have to be careful to correctly capture and process the output of every child you invoke (e.g. compilers), and deal correctly with the tricky aspects of pipes (deadlocks, sigpipe, ...). And even then you can get thwarted by accidentally importing the wrong library into your main process, and discovering that it writes directly to stdout/stderr on some error condition. And it may or may not respect your resetting of sys.stdout/sys.stderr at the python level. So to be really reliable the only thing to do is to create some pipes and some threads to read the pipes and do the dup2 dance (but not everyone will actually do this, they'll just accept corrupted output on errors) and ugh, all of this is a huge hassle that massively raises the bar on implementing simple build systems.
How is this not true for a worker.py process as well? If the worker process communicates via stdout then it has to make sure it captures the stdout and redirects it before calling into the Python API and then undoes that afterwords. It makes it harder to do incremental output actually because a Python function can’t return in the middle of execution so we’d need to make it some sort of akward generator protocol to make that happen too.
Did you, uh, read the second half of my email? :-) My actual position is that we shouldn't even try to get structured incremental output from the build system, and should stick with the current approach of unstructured incremental output on stdout/stderr. But if we do insist on getting structured incremental output, then I described a system that's much easier for backends to implement, while leaving it up to the frontend to pick whether they want to bother doing complicated redirection tricks, and if so then which particular variety of complicated redirection trick they like best. In both approaches, yeah, any kind of incremental output is eventually come down to some Python code issuing some sort of function call that reports progress without returning, whether that's sys.stdout.write(json.dumps(...)) or progress_reporter.report_update(...). Between those two options, it's sys.stdout.write(json.dumps(...)) that looks more awkward to me. -n -- Nathaniel J. Smith -- http://vorpus.org
On 12 November 2015 at 08:07, Nathaniel Smith
On Wed, Nov 11, 2015 at 10:42 AM, Donald Stufft
wrote: On November 11, 2015 at 1:38:38 PM, Nathaniel Smith (njs@pobox.com) wrote:
Guaranteeing a clean stdout/stderr is hard: it means you have to be careful to correctly capture and process the output of every child you invoke (e.g. compilers), and deal correctly with the tricky aspects of pipes (deadlocks, sigpipe, ...). And even then you can get thwarted by accidentally importing the wrong library into your main process, and discovering that it writes directly to stdout/stderr on some error condition. And it may or may not respect your resetting of sys.stdout/sys.stderr at the python level. So to be really reliable the only thing to do is to create some pipes and some threads to read the pipes and do the dup2 dance (but not everyone will actually do this, they'll just accept corrupted output on errors) and ugh, all of this is a huge hassle that massively raises the bar on implementing simple build systems.
How is this not true for a worker.py process as well? If the worker process communicates via stdout then it has to make sure it captures the stdout and redirects it before calling into the Python API and then undoes that afterwords. It makes it harder to do incremental output actually because a Python function can’t return in the middle of execution so we’d need to make it some sort of akward generator protocol to make that happen too.
Did you, uh, read the second half of my email? :-) My actual position is that we shouldn't even try to get structured incremental output from the build system, and should stick with the current approach of unstructured incremental output on stdout/stderr. But if we do insist on getting structured incremental output, then I described a system that's much easier for backends to implement, while leaving it up to the frontend to pick whether they want to bother doing complicated redirection tricks, and if so then which particular variety of complicated redirection trick they like best.
In both approaches, yeah, any kind of incremental output is eventually come down to some Python code issuing some sort of function call that reports progress without returning, whether that's sys.stdout.write(json.dumps(...)) or progress_reporter.report_update(...). Between those two options, it's sys.stdout.write(json.dumps(...)) that looks more awkward to me.
I think there is some big disconnect in the conversation. AIUI Donald
and Marcus and I are saying that build systems should just use
print("Something happened")
to provide incremental output.
-Rob
--
Robert Collins
On November 11, 2015 at 2:08:00 PM, Nathaniel Smith (njs@pobox.com) wrote:
On Wed, Nov 11, 2015 at 10:42 AM, Donald Stufft wrote:
On November 11, 2015 at 1:38:38 PM, Nathaniel Smith (njs@pobox.com) wrote:
Guaranteeing a clean stdout/stderr is hard: it means you have to be careful to correctly capture and process the output of every child you invoke (e.g. compilers), and deal correctly with the tricky aspects of pipes (deadlocks, sigpipe, ...). And even then you can get thwarted by accidentally importing the wrong library into your main process, and discovering that it writes directly to stdout/stderr on some error condition. And it may or may not respect your resetting of sys.stdout/sys.stderr at the python level. So to be really reliable the only thing to do is to create some pipes and some threads to read the pipes and do the dup2 dance (but not everyone will actually do this, they'll just accept corrupted output on errors) and ugh, all of this is a huge hassle that massively raises the bar on implementing simple build systems.
How is this not true for a worker.py process as well? If the worker process communicates via stdout then it has to make sure it captures the stdout and redirects it before calling into the Python API and then undoes that afterwords. It makes it harder to do incremental output actually because a Python function can’t return in the middle of execution so we’d need to make it some sort of akward generator protocol to make that happen too.
Did you, uh, read the second half of my email? :-) My actual position is that we shouldn't even try to get structured incremental output from the build system, and should stick with the current approach of unstructured incremental output on stdout/stderr. But if we do insist on getting structured incremental output, then I described a system that's much easier for backends to implement, while leaving it up to the frontend to pick whether they want to bother doing complicated redirection tricks, and if so then which particular variety of complicated redirection trick they like best.
In both approaches, yeah, any kind of incremental output is eventually come down to some Python code issuing some sort of function call that reports progress without returning, whether that's sys.stdout.write(json.dumps(...)) or progress_reporter.report_update(...). Between those two options, it's sys.stdout.write(json.dumps(...)) that looks more awkward to me.
I’m confused how the progress indicator you just implemented would work if there wasn’t something triggering a “hey I’m still doing work” to incrementally output information. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
On Wed, Nov 11, 2015 at 4:29 AM, Donald Stufft
On November 11, 2015 at 4:05:11 AM, Nathaniel Smith (njs@pobox.com) wrote:
But even this isn't really true -- the difference between them is that either way you have a subprocess API, but with a Python API, the subprocess interface that pip uses has the option of being improved incrementally over time -- including, potentially, to take further advantage of the underlying richness of the Python semantics. Sure, maybe the first release would just take all exceptions and map them into some text printed to stderr and a non-zero return code, and that's all that pip would get. But if someone had an idea for how pip could do better than this by, I dunno, encoding some structured metadata about the particular exception that occurred and passing this back up to pip to do something intelligent with it, they absolutely could write the code and submit a PR to pip, without having to write a new PEP.
I think I prefer a CLI based approach (my suggestion was to remove the formatting/interpolation all together and just have the file include a list of things to install, and a python module to invoke via ``python -m <thing provided by user>``).
The main reason I think I prefer a CLI based approach is that I worry about the impedance mismatch between the two systems. We’re not actually going to be able to take advantage of Python’s plethora of types in any meaningful capacity because at the end of the day the bulk of the data is either naturally a string or as we start to allow end users to pass options through pip into the build system, we have no real way of knowing what the type is supposed to be other than the fact we got it as a CLI flag. How does a user encode something like “pass an integer into this value in the build system?” on the CLI in a generic way? I can’t think of any way which means that any boundary code in the build system is going to need to be smart enough to handle an array of arguments that come in via the user typing something on the CLI. We have a wide variety of libraries to handle that case already for building CLI apps but we do not have a wide array of libraries handling it for a Python API. It will have to be manually encoded for each and every option that the build system supports.
You're overcomplicating things :-). The solution to this problem is just "pip's UI only allows passing arbitrary strings as option values, so build backends had better deal with it". That's what we'd effectively be doing anyway in the CLI approach.
My other concern is that it introduces another potential area for mistake that is a bit harder to test. I don’t believe that any sort of “worker.py” script is ever going to be able to handle arbitrary Python values coming back as a return value from a Python script. Whatever serialization we use to send data back into the main pip process (likely JSON) will simply choke and cause an error if it encounters a type it doesn’t know how to serialize. However this error case will only happen when the build system is being invoked by pip, not when it is being invoked “naturally” in the build system’s unit tests. By forcing build tool authors to write a CLI interface, we push the work of “how do I serialize my internal data structures” down onto them instead of making it some implicit piece of code that pip needs to work.
I think this is another issue that isn't actually a problem. Remember, we don't need to support translating arbitrary Python function calls across process boundaries; there will be a fixed, finite set of methods that we need to support, and those methods' semantics will be defined in a PEP. So e.g., if the PEP says that build backends should define a method like this: def build_requirements(self, build_options): """Calculate the dynamic portion of the build-requirements. :param build_options: The build options dictionary. :returns: A list of strings, where each string is a PEP XX requirement specifier. """ then our IPC mechanism doesn't need to be able to handle arbitrary types as return values, it needs to be able to handle a list of strings. Which that sketch I sent does handle, so we're good. And the build tool's unit tests will be checking that it returns a list of strings, because... that's what unit tests do, they validate that methods implement the interface that they're defined to implement :-). So this is a non-problem -- we just have to make sure when we define the various method interfaces in the PEP that we don't have any methods that return arbitrary complicated Python types. Which we weren't going to be tempted to do anyway. -n -- Nathaniel J. Smith -- http://vorpus.org
On Wed, Nov 11, 2015 at 11:12 AM, Robert Collins
On 12 November 2015 at 08:07, Nathaniel Smith
wrote: On Wed, Nov 11, 2015 at 10:42 AM, Donald Stufft
wrote: On November 11, 2015 at 1:38:38 PM, Nathaniel Smith (njs@pobox.com) wrote:
Guaranteeing a clean stdout/stderr is hard: it means you have to be careful to correctly capture and process the output of every child you invoke (e.g. compilers), and deal correctly with the tricky aspects of pipes (deadlocks, sigpipe, ...). And even then you can get thwarted by accidentally importing the wrong library into your main process, and discovering that it writes directly to stdout/stderr on some error condition. And it may or may not respect your resetting of sys.stdout/sys.stderr at the python level. So to be really reliable the only thing to do is to create some pipes and some threads to read the pipes and do the dup2 dance (but not everyone will actually do this, they'll just accept corrupted output on errors) and ugh, all of this is a huge hassle that massively raises the bar on implementing simple build systems.
How is this not true for a worker.py process as well? If the worker process communicates via stdout then it has to make sure it captures the stdout and redirects it before calling into the Python API and then undoes that afterwords. It makes it harder to do incremental output actually because a Python function can’t return in the middle of execution so we’d need to make it some sort of akward generator protocol to make that happen too.
Did you, uh, read the second half of my email? :-) My actual position is that we shouldn't even try to get structured incremental output from the build system, and should stick with the current approach of unstructured incremental output on stdout/stderr. But if we do insist on getting structured incremental output, then I described a system that's much easier for backends to implement, while leaving it up to the frontend to pick whether they want to bother doing complicated redirection tricks, and if so then which particular variety of complicated redirection trick they like best.
In both approaches, yeah, any kind of incremental output is eventually come down to some Python code issuing some sort of function call that reports progress without returning, whether that's sys.stdout.write(json.dumps(...)) or progress_reporter.report_update(...). Between those two options, it's sys.stdout.write(json.dumps(...)) that looks more awkward to me.
I think there is some big disconnect in the conversation. AIUI Donald and Marcus and I are saying that build systems should just use
print("Something happened")
to provide incremental output.
I agree that this is the best approach. This particular subthread is all hanging off of Paul's message [1] where he argues that we can't just print arbitrary text to stdout/stderr, we need, like, structured JSON messages on stdout that pip can parse while the build is running. (Which implies that you can *only* have structured JSON messages on stdout, because otherwise there's no way to tell which bits are supposed to be structured and which bits are just arbitrary text.) And I said well, I think that's probably overcomplicated and unnecessary, but if you insist then this is what it would look like in the different approaches. (Your current draft does create similar challenges for build backends because it also uses stdout for passing structured data. But I know you're in the middle of rewriting it anyway, so maybe this is irrelevant.) -n [1] http://thread.gmane.org/gmane.comp.python.distutils.devel/24760/focus=24792 -- Nathaniel J. Smith -- http://vorpus.org
On 11 November 2015 at 18:38, Nathaniel Smith
Have you tried current dev versions of pip recently?
No, but I did see your work on this, and I appreciate and approve of it.
The default now is to suppress the actual output but for progress reporting to show a spinner that rotates each time a line of text would have been printed. It's low tech but IMHO very effective. (And obviously you can also flip a switch to either see all or nothing of the output as well, or if that isn't there now if books really be added.) So I kinda feel like these are solved problems.
And this relies on build tools outputting to stdout, not stderr, and not buffering their output. That's an interface spec. Not everything has to be massively complicated, and I wasn't implying it needed to be. Just that we need conventions. One constant annoyance for pip is that distutils doesn't properly separate stdout and stderr, so we can't suppress unnecessary status reports without losing important error messages. Users report this as a bug in pip, not in distutils, and I don't imagine that would change if a project was using <name your build tool here>.
Taking all of those requirements into account, pip *has* to have some level of control over the output of a build tool - with setuptools at the moment, we have no such control (other than "we may or may not show the output to the user") and that means we struggle to realistically satisfy all of the conflicting requirements we have.
So we do need much better defined contracts over stdin, stdout and stderr, and return codes. This is true whether or not the build system is invoked via a Python API or a CLI.
Even if you really do want to define a generic structured system for build progress reporting (it feels pretty second-systemy to me), then in the python api approach there are better options than trying to define a specific protocol on stdout.
No, no, no. I never said that. All I was saying was that we need a level of agreement on what pip can expect to do with stdout and stderr, *given that there are known requirements pip's users expect to be satisfied*. Paul
On 11 November 2015 at 19:31, Nathaniel Smith
This particular subthread is all hanging off of Paul's message [1] where he argues that we can't just print arbitrary text to stdout/stderr, we need, like, structured JSON messages on stdout that pip can parse while the build is running
As I already pointed out, I never said that. Paul
On Wed, Nov 11, 2015 at 11:34 AM, Paul Moore
On 11 November 2015 at 18:38, Nathaniel Smith
wrote: Have you tried current dev versions of pip recently?
No, but I did see your work on this, and I appreciate and approve of it.
The default now is to suppress the actual output but for progress reporting to show a spinner that rotates each time a line of text would have been printed. It's low tech but IMHO very effective. (And obviously you can also flip a switch to either see all or nothing of the output as well, or if that isn't there now if books really be added.) So I kinda feel like these are solved problems.
And this relies on build tools outputting to stdout, not stderr, and not buffering their output.
FWIW the spinner patch actually looks at both stdout and stderr, and it also takes care to force the child process's sys.stdout/sys.stderr into line-buffered mode, but of course this buffering tweak only helps for output printed by python code running in the immediate child. So yeah, it wouldn't hurt to add a few non-normative words about buffering to my original one-sentence specification :-).
That's an interface spec. Not everything has to be massively complicated, and I wasn't implying it needed to be. Just that we need conventions. One constant annoyance for pip is that distutils doesn't properly separate stdout and stderr, so we can't suppress unnecessary status reports without losing important error messages. Users report this as a bug in pip, not in distutils, and I don't imagine that would change if a project was using <name your build tool here>.
Sorry for misunderstanding! I guess the other thing we could do is to try to convince build systems to do a better job of separating stdout and stderr, but I'm dubious about how much this would help, because I think the problem is more fundamental than that. For outright errors, there isn't really a problem IMO, because when the build fails that gives you a clear signal that you should probably show the user the output :-). The case that's trickier, and could potentially benefit, is warnings that don't cause the build to fail. If gcc outputs a warning, should we show that to the user? Yes if this is the developer building their own code... but probably not if this is pip building from an automatically downloaded sdist for an end-user -- there are lots and lots of harmless warnings in the output of popular packages, and dumping those scary and inscrutable messages on end-users is going to create all the problems we were trying to avoid by hiding the output in the first place. -n -- Nathaniel J. Smith -- http://vorpus.org
participants (8)
-
Donald Stufft
-
Marcus Smith
-
Nathaniel Smith
-
Nick Coghlan
-
Paul Moore
-
Robert Collins
-
Steve Dower
-
Wes Turner