On May 29, 2017, at 4:04 PM, Nathaniel Smith <njs@pobox.com> wrote:

Ugh, sorry, fat-fingered that. Actual reply below...

On Mon, May 29, 2017 at 12:56 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, May 29, 2017 at 12:50 PM, Donald Stufft <donald@stufft.io> wrote:

To be honest, I’m not hardly going to feel particularly bad if one of the
most compilation heavy packages that exist takes a whole 10 seconds to
install from a VCS checkout.

Rebuild latency is *really* important. People get really cranky at me
when I argue that we should get rid of "editable installs", which
create much greater problems for maintaining consistent environments,
and that's only saving like 1 second of latency. I think I'm entitled
to be cranky if your response is "well suck it up and maybe rewrite
all your build tools”.

Well, distutils literally already has support for storing the “cache" someplace other than the current directory, the current directory is just the default. So “rewrite all your build tools” is fairly hyperbolic, it’s really just “change the default of your build tools”. See for example: https://gist.github.com/dstufft/a577c3c9d54a3bb3b88e9b20ba86c625 which shows that Numpy &tc are already capable of this.

Hell, the build backend could create an unpacked sdist in the target directory instead of an actual sdist that is already packed into a tarball, tools like twine could add a ``twine sdist`` command that just called the “create unpacked sdist” API and then just tar’d up the directory into the sdist. A quick rudimentary test on my machine (using ``python setup.py sdist —formats=`` in a numpy checkout [1]) suggests that this entire process takes ~0.7s which the copy operation on that same check out (shutil.copytree) also takes ~0.7s. That also eliminates the need to untar so unless someone is doing something in their sdist creation step that takes a significant amount of time, generating an unpacked sdist is really not any more time consuming than copying the files.


NumPy really isn't that compilation heavy either... it's all C, which
is pretty quick. SciPy is *much* slower, for example, as is pretty
much any project using C++.

Particularly when I assume that the build tool
can be even smarter here than ccache is able to be to reduce the setup.py
build step back down to the no-op incremental build case.

I mean, unless numpy is doing something different, the default distutils
incremental build stuff is incredibly dumb, it just stores the build output
in a directory (by default it’s located in ./build/) and compares the mtime
of a list of source files with the mtime of the target file, and if the
sources files are newer, it recompiles it. If you replace mtime with blake2
(or similar) then you can trivially support the exact same thing just
storing the built target files in some user directory cache instead.

Cache management is not a trivial problem.

And it actually doesn't matter, because we definitely can't silently
dump stuff into some user directory. An important feature of storing
temporary artifacts in the source tree is that it means that if
someone downloads the source, plays around with it a bit, and deletes
it, then it's actually gone. We can't squirrel away a few hundred
megabytes of data in some hidden directory that will hang around for
years after the user stops using numpy.


I mean, you absolutely can do that. We store temporary wheels and HTTP responses silently in pip and have for years. I don’t think *anyone* has *ever* complained about it. I think macOS even explicitly will clean up stuff from ~/Library/Caches when it hasn’t been used in awhile. If you use the standard cache locations for Linux then IIRC similar systems exist for Linux too. In exchange for “I can delete the directory and it’s just all gone”, you get “faster builds in more scenarios, including straight from PyPI’s sdists”. If I were a user I’d care a lot more about the second then the first.

But even if I grant you that you can’t just do that silently, then go ahead and make it opt in. For people who need it, a simple boolean in a config file seems to be pretty low cost to me.


Combine this user cache with generating an unpacked sdist instead of copying the directory tree, and you get:

1) Safety from weirdness that comes from ``pip install`` from a sdist versus a VCS.
2) Not crapping up ./ with random debris from the installation process.
3) Fast incremental builds that even help speed up installs from PyPI etc (assuming we use something like blake2 to compute hashes for the files).

And you lose:

1) Deleting a clone doesn’t delete the cache directory, but your OS might already be managing this directory anyways.

Seems like an obvious trade off to me.



Hell,
we *might* even be able to preserve mtime (if we’re not already… we might
be! But I’d need to dig into it) so literally the only thing that would need
to change is instead of storing the built artifacts in ./build/ you store
them in ~/.cache/my-cool-build-tool/{project-name}. Bonus points: this means
you get incremental speeds even when building from a sdist from PyPI that
doesn’t have wheels and hasn’t changed those files either.

I’m of the opinion that first you need to make it *correct*, then you can
try to make it *fast*. It is my opinion that a installer that shits random
debris into your current directory is not correct. It’s kind of silly that
we have to have a “random pip/distutils/setuptools” crap chunk of stuff to
add to .gitignore to basically every Python package in existence. Nevermind
the random stuff that doesn’t currently get written there, but will if we
stop copying files out of the path and into a temporary location (I’m sure
everyone wants a pip-egg-info directory in their current directory).

I’m also of the opinion that avoiding foot guns is more important than
shooting for the fastest operation possible. I regularly (sometimes multiple
times a week!, but often every week or two) see people tripping up on the
fact that ``git clone … && pip install .`` does something different than
``git clone … && python setup.py sdist && pip install dist/*``. Files
suddenly go missing and they have no idea why. If they’re lucky, they’ll
figure out they need to modify some combination of package_data, data_files,
and MANIFEST.in to make it work, if they’re not lucky they just sit there
dumbfounded at it.

Yeah, setuptools is kinda sucky this way. But this is fixable with
better build systems. And before we can get better build systems, we
need buy-in from devs. And saying "sorry, we're unilaterally screwing
up your recompile times because we don't care" is not a good way to
get there :-(


I don’t think it has anything to do with setuptools TBH other than the fact that it’s interface for declaring what does and doesn’t go into a sdist is kind of crummy. This problem is going to exist as long as you have any mechanism for having some files not be included inside of a sdist.





Also also, notice elsewhere in the thread where Thomas notes that flit
can't build an sdist from an unpacked sdist. It seems like 'pip
install unpacked-sdist/' is an important use case to support…


If the build tool gives us a mechanism to determine if something is an
unpacked sdist or not so we can fallback to just copying in that case, that
is fine with me. The bad case is generally only going to be hit on VCS
checkouts or other not sdist kinds of source trees.

I guess numpy could just claim that all VCS checkouts are actually
unpacked sdists…?

I mean, ``pip install .`` is still going to ``cp -r`` that VCS checkout into a temporary location if you do that, and making sure that the invariant of ``python setup.py build && pip install .`` doesn’t trigger a recompile isn’t going to be something that I would want pip to start doing. So would it _work_ for this use case? Possibly? Is it supported? Nope, if it breaks you get to keep both pieces.



Donald Stufft