metadata in sdists (was Re: Second draft of a plan for a new source tree / sdist format)
On Wed, Oct 28, 2015 at 5:32 AM, Daniel Holth <dholth@gmail.com> wrote:
Nathaniel,
I'm not sure what the software is supposed to do with fine grained dynamic metadata that would make very much sense to the end user. I think you could probably get away with a single flag Dynamic: true / false. Iff true, pip runs the dist-info command after installing bootstrap dependencies. You could still complain if the name & version changed. Of course in a VCS checkout or during development you probably always want the regenerate-metadata behavior.
So, right, the draft I just posted proposes that sdists should contain static wheel-style metadata where possible, with a fine-grained mechanism for marking specific parts as being dynamically generated at build time. The different motivations that I'm trying to balance here are: - Donald wants the fields that PyPI cares about (name, version, summary, long_description, author, homepage, trove classifiers, etc.) to *always* be statically present in sdists, even if there is some other metadata that can't be made static (i.e., install-requirements) - Robert wants the (name, version, install-requirements) information to be available statically as often as possible, because that's the information that the resolver needs to know when it's considering trying to install some (name, version) pair, and he wants this information to be cheap to access because the resolver may have to backtrack and consider many different versions of the same package. Installing build-requirements and then running egg-info/dist-info is already pretty expensive, so it would be good if install-requirements could be static in the 99% of cases where they are known. - I'm still holding out some small hope of killing off the "egg-info"/"dist-info" step entirely, because any time you have two different operations and it's a bug if they can get out of sync (in this case: the dist-info step, and then the actual wheel building step), then maintaining and testing is a hassle and eventually that bug will happen. Basically it's a violation of DRY -- you have two sources of the same data and they're both supposed to be authoritative. And we know from experience with egg_info that there will be strong temptations for people to cheat, which creates all kinds of headaches. So how can we balance these different design goals? Obviously a new sdist format should provide static metadata for the fields PyPI cares about -- we basically require this already for sdists, we just encode it in a weird out-of-band way during the PyPI upload instead of recording it authoritatively in the sdist itself. Then that leaves a 2x2 space of plausible design options: we can either support dist-info as a separate operation from building, or not; and we can make a best-effort to provide static install-requirements when they're available, or we can not bother and just never provide static install-requirements. None of this matters for 'pip install <source directory>' None of this matters for installations using wheels. None of this matters for installations where the requirements are straightforward to satisfy (e.g. cases where the current pseudo-resolver works okay). None of this affects correctness -- it's purely an optimization. But maybe it's an important optimization in certain specific cases. The case where this matters is like: suppose you just did 'pip install scikit-learn', and scikit-learn requires scipy and numpy. And scipy also requires numpy, so our dependency graph forms a triangle. And let's say that scikit-learn is happy with any version of scipy, but scikit-learn and scipy both have versioned requirements on numpy. So for example, scikit-learn might require numpy (< 1.9). Meanwhile, the latest version of scipy requires numpy (>= 1.9)... so there's no version of numpy that satisfies both at once, which means we can't use this version of scipy. So you have to consider the next-to-latest version of scipy. But then it turns out that it also requires numpy (>= 1.9). So then you have to consider the next-to-next-to-latest version of scipy, which turns out to require numpy (>= 1.8), so phew, we can use numpy = 1.8 to satisfy both of these at once and we're done. And, to make this maximally annoying, it turns out that there is no scipy wheel for your platform, so all this checking of scipy's install-requires has to be done based on whatever information is inside the scipy sdist. And finally, the reason we use scipy in this example is that building scipy turns out to be really slow (~8 minutes on my laptop), definitely an outlier among python packages. There's three ways this can play out: Option 1, "fast": if the scipy sdists have static install-requires metadata, then this is pretty trivial. You might have to download the sdists for the first two non-working versions, but once you have the files you can immediately tell from the metadata that they aren't going to be usable and move on. (Of course, scipy itself probably *can't* have static install-requires metadata, but most packages can.) Option 2, "slow": if the scipy sdists don't have static install-requires metadata, but do support the dist-info operation, then you do something like: - download and unpack scipy-XX.zip - discover that it build-requires numpy. If there's no scipy wheel for your platform, there's presumably no numpy wheel either, so we'll have to build it. - download and build numpy, and install your newly built numpy into scipy's build environment (~2 minutes?) - run scipy's dist-info (a few seconds?) - go back to the top and repeat for the next candidate Option 3, "slowest": if you don't have static install-requires *and* you don't have the dist-info operation, then it's identical to the "slow" option, except instead of dist-info you have to actually run the full scipy build at each step, so it takes ~10 minutes per cycle instead of ~2 minutes per cycle (or maybe a bit less depending on whether the build-requirements work out so that you can re-use the same build of numpy for both attempts, etc.) Now, our four design options break down like this: no dist-info / never record install-requirements in sdist: 100% of sdists are in the "slowest" path no dist-info / record install-requirements in sdist when available: 95% of sdists are on "fast" path, 5% (the ones with dynamic install-requirements, like scipy) are on the "slowest" path dist-info / never record install-requirements in sdist (= current sdist format): 100% of sdists are in the "slow" path dist-info / record install-requirements in sdist when available: 95% of sdists are on "fast" path, 5% are on the "slow" path *** Summary *** There are two different ways to speed up resolution of dependencies involving sdists. static install-requirements are super fast when they work, and their complexity cost is low but non-zero. (We know we want to include lots of other metadata in sdists, and we already have standard ways to represent lots of metadata + install-requirements, so the only new thing is the __SDIST_DYNAMIC__ bit.) dist-info is slower and always works, but has a higher complexity cost. (Every build backend needs to implement this extra operation, and it's a bit fragile.) If you think that: - automagic building from sdists is an important case to support seamlessly and will remain so indefinitely, and - these kinds of complicated conflicts are common, and - it's important to optimize them to be resolved as fast as possible, then you'll want dist-info + static install-requirements. (IIUC this is Robert's position.) OTOH if you think that: - automagic building of sdists is a mixed blessing at best (at least in this particular example, what mostly actually happens is that people get angry and swear at their computer because they didn't want 'pip install scikit-learn' to go rebuild scipy and numpy, argh why is it my laptop suddenly swapping to death / argh I don't actually have a functional toolchain installed because I'm on windows / argh the resulting builds will seem to work at first but in fact be horribly slow because my BLAS is in funny place where it won't be autodetected and so they'll fall back on unoptimized routines / ... see also Paul's requests to just turn off sdist building entirely), and - by the time these new formats are implemented and in wide use in a few years then the increased maturity of the wheel ecosystem + Linux wheels on PyPI + Donald's planned automated build servers will mean that automagic building of sdists will only be rarely needed practice, and - these kinds of pathological requirements conflicts are going to be pretty rare and anyway when they do arise then you don't care whether it takes 60 minutes or merely 20 to sort out scikit-learn's dependencies because you're probably going to hit Control-C either way... then you might prefer the no-dist-info + no static install-requirements approach, because it simplifies the build system / packaging stuff :-). (That was the reasoning behind my original proposal.) And if you like that second argument, but have gotten push-back from people like Robert, then you might try writing up a proposal for static metadata in sdists, in the hopes that no-dist-info / yes-static-metadata might be an acceptable compromise between the above two positions :-). This discussion is important structurally is because it's where the design of a new build system interface becomes coupled to the design of a new sdist format: if you want to get a new build system interface done as soon as possible and don't want to touch the static metadata stuff, then the choice is between "no dist-info / no static metadata" versus "yes dist-info / no static-metadata", and if these are your options then it pushes you strongly towards supporting a dist-info operation. But then you risk saddling all build systems with the requirement to implement this dist-info thing, even if it later turns out that static sdist metadata makes it unnecessary. As mentioned in the other thread about promoting extras to first-class packages, there is still some hope that we might be able to get away with having full static metadata in all sdists, in which case the dist-info operation becomes completely vestigial -- but that discussion is going to be much more involved and certainly requires the addition of a real resolver for pip. So we probably don't want to block all progress on better build system while waiting for that. One possible compromise to streamline things might be: - give up on specifying a new sdist format for now (sorry Donald!) -- it's a good idea, we have some idea how to do it, but treat it as a separate project - include a dist-info operation in the new standard build system interface... - ...but make it optional. The cost of making it optional is that pip would need to be prepared with a fallback if the dist-info operation is not available -- so the get-install-requirements step would look like: def get_install_requirements(source_directory): if supports_dist_info(source_directory): dist_info_directory = run_dist_info_hook(source_directory) return dist_info_to_install_requirements(dist_info_directory) else: wheel = run_wheel_build_hook(source_directory) return wheel_to_install_requirements(wheel) Compared to a design where dist-info is mandatory, this is some extra complexity added to pip, but not a huge amount -- the above is pseudo-code, but aside from supports_dist_info all the functions it calls are ones that need to exist anyway, and the extra logic is neatly encapsulated within get_install_requirements. The minor advantage of doing this is that if some project/build system doesn't need this logic they don't need to implement it (e.g. for flit, which only supports pure python packages, there's absolutely no use -- it can build a wheel as fast as it can generate just the dist-info, and it doesn't even matter because projects using flit will ~always have wheels available and the resolver will ~never need to look at an sdist). The major advantage is that if later on it turns out that dist-info is useless (e.g. because sdists start shipping with full static metadata), then making it optional now means that we'll have the option of just dropping the support from build systems and from pip without breaking backcompat. (Compare to the situation where we make it mandatory now: since there will be versions of pip in the wild that blow up if the operation is not provided, every build system may be stuck providing some sort of dist-info support for a decade or whatever, even if it's not needed.) Alternatively we could just leave dist-info out for now, and implement it later as an optional optimization that a build system can provide if it turns out to matter :-). This would definitely help move things forward more quickly, because I wouldn't be surprised if the most complicated part of implementing Robert's current spec turned out to be the code needed to validate that dist-info and build-wheel actually return matching metadata :-). You absolutely need to do this by default, because otherwise we'll be back where we are now with people intentionally playing horrible fragile tricks, and weird accidental inconsistencies going undetected and creating subtle bugs. But to implement this checking you need some complex flow control to make sure that the eventual build-wheel step has access to the original metadata, plus the code for checking metadata-equality might be non-trivial... -n -- Nathaniel J. Smith -- http://vorpus.org
(Hugely trimmed, because I coldn't find an easy way to pick out the important bits of context, sorry!) On 29 October 2015 at 23:23, Nathaniel Smith <njs@pobox.com> wrote:
None of this affects correctness -- it's purely an optimization. But maybe it's an important optimization in certain specific cases.
One concern I have is that it's *not* just an optimisation in some cases. If a build being used to get metadata fails, what will happen then? If you fail the whole install process, then using your scikit-learn case, suppose there are wheels available for older versions of scipy, but none for the latest version (a very common scenario, in my experience, for a period after a new release appears). Then the dependency resolution tries to build the latest version to get the metadata, fails, and things stop. But the older version is actually fine, because the wheel can be used. You could treat build failures as "assume not suitable", but that could result in someone getting an older version when a compile fails, rather than getting the error (which in less complex cases than the above, they might want so they can fix it - e.g. by setting an environment variable they'd forgotten, or by downloading a wheel from a non-PyPI repository like Christoph Gohlke's). So while I follow your explanation for the cases where builds always succeed but might take forever, I'm not so sure your conclusions are right for a mix of wheels for some versions, failing builds and other partially-working scenarios. This case concerns me far more in practice than complex dependency graphs do. Paul
participants (2)
-
Nathaniel Smith
-
Paul Moore