On Jun 1, 2017, at 6:28 PM, Paul Moore <p.f.moore@gmail.com> wrote:

On 1 June 2017 at 23:14, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Thu, Jun 1, 2017, at 10:49 PM, Paul Moore wrote:
pip also needs a way to deal with "pip install <local directory>. In
this case, pip (under its current model) copies that directory to a
working area. In that area, it runs the build command to create a
wheel, and proceeds from there. In principle, there's little change in
a PEP 517 world. But again, see below.

I still question whether the copying step is necessary for the frontend.
Pip does it for setup.py builds (AIUI) because they might modify or
create files in the working directory, and it wants to keep the source
directory clean of that. Flit can create a wheel without
modifying/creating any files in the working directory.

That's a very fair comment, and I honestly don't know how critical the
copy step is - in the sense that I know we do it to prevent certain
classes of issue, but I don't know what they are, or how serious they
are. Perhaps Donald does?

It's certainly true that setup.py based builds are particularly
unpleasant for the obvious "running arbitrary code" reasons. But I'm
not sure how happy I am simply saying "backends must ..." what? How
would we word this precisely? It's not just about keeping the sources
clean, it's also about not being affected by unexpected files in the
source directory. Consider that a build using a compiler will have
object files somewhere. Should a backend use existing object files in
preference to sources? What about a backend based on a tool designed
to do precisely that, like waf or make? What if the files came from a
build with different compiler flags? Sure, it's user error or a
backend bug, but it'll be reported to pip as "I tried to install foo
and my program failed when I imported it". We get that sort of bug
report routinely (users reporting bugs in build scripts as pip
problems) and we'll never have a technical solution to all the ways
they can occur, but preventative code like copying the build files to
a clean location can minimise them. (As I say, I'm speculating about
whether that's actually why we build in a temp location, but it's
certainly the sort of thinking that goes into our design).


I suspect the original reasoning behind copying to a temporary location has been lost to the sands of time. We’ve been doing that in pip for as long as I’ve worked on pip, maybe Jannis or someone remembers why I dunno!

From my end, copying the entire directory alleviates a few problems:

* In the current environment, it prevents random debris from cluttering up and being written to the current directory, including build files.
  * This is important, because not only is it unhygienic to allow random bits of crap to crap all over the local directory, but in the current system the build directories are not sufficiently platform dependent (e.g. a Linux build only gets identified as a Linux build, even if it links against two different ABIs because it was mounted inside of a Debian and a CentOS Docker container).

* It reduces errors caused by people/tooling editing files while a build is being processed. This can’t ever be fully removed, but by copying to a temporary location we narrow the window down considerably where someone can inadvertently muck up their build mid progress.

* It prevents some issues with two builds running at the same time.

Narrowing that down to producing a sdist (or some other mechanism for doing a “copy what you would need” hook) in addition prevents:

* Unexpected files changing the behavior of the build.

* Misconfigured build tools appearing to “work” in development but failing when the sdist is released to PyPI or having the sdist and wheels be different because the wheel was produced from a VCS checkout but a build from a sdist wasn’t.

Ultimately you’re right, we could just encode this into PEP 517 and say that projects need to *either* give us a way to copy the files they need OR they need hygienic builds that do not modify the current directory at all. I greatly prefer *not* to do that though, because everyone is only human, and there is likely to be build backends that don’t do that— either purposely or accidentally— and it’ll likely be pip that fields those support issues (because they’ll see it as they invoked pip, so it must be pip’s fault).

In my mind the cost of *requiring* some mechanism of doing this is pretty low, the project obviously needs to know what files are important to it or else how is it going to know what it’s going to build in the first place. For most projects the amount of data that *needs* copied (versus is just stuff that is sitting there taking up space) is pretty small, so even on a really slow HDD the copy operating should not be a significant amount of time. It’s also not a particularly hard thing to implement I think— certainly it’s much easier than actually building a project in the first place.

There’s a principle here at Amazon that goes, “Good intentions don’t matter”. Which essentially means that simply saying you’re going to do something good doesn’t count because you’re inevitably going to forget or mess up and that instead of just having the intention to do something, you should have a process in place that ensures it is going to happen. Saying that we’re going to make the copying optional and hope that the build tools correctly build in place without an issue feels like a “good intention” to me, whereas adding the API and step that *mandates* (through technical means) they do it correctly is putting a process in place that ensures it is going to happen.

Donald Stufft