[Distutils] Second draft of a plan for a new source tree / sdist format

Nathaniel Smith njs at pobox.com
Wed Oct 28 05:08:10 EDT 2015

On Tue, Oct 27, 2015 at 3:43 AM, Paul Moore <p.f.moore at gmail.com> wrote:
> On 26 October 2015 at 06:04, Nathaniel Smith <njs at pobox.com> wrote:
> > Here's a second round of text towards making a build-system
> > independent interface between pip and source trees/sdists. My idea
> > this time is to take a divide-and-conquer approach: this text tries to
> > summarize all the stuff that it seemed like we had mostly reached
> > consensus on in the previous thread + call, with blank chunks marked
> > "TBD" where there are specific points that still need To Be
> > Determined. So my hope is that everyone will read what's here and
> > agree that it's great as far as it goes, and then we can go through
> > and fill in each missing piece one at a time.
> I'll comment on what's here, but ignore the TBD items - I'd rather (as
> you suggest) leave discussion of those details till the basic idea is
> agreed.
> > Abstract
> > ========
> >
> > Distutils delenda est.
> While this makes a nice tagline, I'd rather something less negative.
> Distutils does not "need" to be destroyed. It's perfectly adequate
> (although hardly user friendly) for a lot of cases - I'd be willing to
> suggest *most* users can work just fine with distutils.
> I'm not a fan of distutils, but I'd prefer it if we kept the rhetoric
> limited - as Nick pointed out this whole area is as much a political
> issue as a technical one.
> > Extended abstract
> > =================
> >
> > While ``distutils`` / ``setuptools`` have taken us a long way, they
> > suffer from three serious problems: (a) they're missing important
> > features like autoconfiguration and usable build-time dependency
> > declaration, (b) extending them is quirky, complicated, and fragile,
> > (c) it's very difficult to use anything else, because they provide the
> > standard interface for installing python packages expected by both
> > users and installation tools like ``pip``.
> Again, this is overstated. You very nearly lost me right here - people
> won't read the details of the proposal if they disagree with the
> abstract(s). Specifically:
> * The features in (a) are only important to *some* parts of the
> community. The scientific community is the major one, and is a huge
> influence over the direction we want to go in, but again, not crucial
> to many people. And even where they might be useful (e.g., Windows
> users building pyyaml, lxml, pillow, ...) the description implies
> "working out what's there" rather than "allowing users to easily
> manage non-Python dependencies", which gives the wrong impression.
> * The features in (b) are highly specialised. Very few people extend
> setuptools/distutils. And those who do, have often invested a lot of
> effort in doing so. Sure, they'd rather not have needed to, but now
> that they have, a replacement system simply means that work is lost.
> Arguably, fixing (b) is only useful for people (like the scientific
> community) who have needed to extend setuptools and have been unable
> to achieve their goals that way. That's an even smaller part of the
> community.
> > Previous efforts (e.g. distutils2 or setuptools itself) have attempted
> > to solve problems (a) and/or (b). We propose to solve (c).
> Agreed - this is a good approach. But it's at odds with your abstract,
> which says distutils must die. Here you're saying you want to allow
> people to keep using distutils but allow people with specialised needs
> to choose an alternative. Or are you offering an alternative to people
> who use distutils?
> The whole of the above is confusing on the face of it. The details
> below clarify a lot, as does knowing how the previous discussions have
> gone. But it would help a lot if the introduction to this PEP were
> clearer.

Fair enough, I'll dial it back. :-)

My personal prediction is that within a year of this support becoming
widespread, we'll see build systems that are just better than
distutils on all axes for all projects, not just the ones with weird
specialised needs -- AFAICT the distutils architectures has remained
basically unchanged since Python 2.0, and we've gained a bit more
experience with Python packaging in the last 15 years :-). But yeah,
sure, if you think it'll bother people then there's no point in that.

> > The goal of this PEP is get distutils-sig out of the business of being
> > a gatekeeper for Python build systems. If you want to use distutils,
> > great; if you want to use something else, then that should be easy to
> > do using standardized methods. The difficulty of interfacing with
> > distutils means that there aren't many such systems right now, but to
> > give a sense of what we're thinking about see `flit
> > <https://github.com/takluyver/flit>`_ or `bento
> > <https://cournape.github.io/Bento/>`_. Fortunately, wheels have now
> > solved many of the hard problems here -- e.g. it's no longer necessary
> > that a build system also know about every possible installation
> > configuration -- so pretty much all we really need from a build system
> > is that it have some way to spit out standard-compliant wheels.
> OK. Although I see a risk here that if I want to build package FOO, I
> now have to worry whether FOO's build system supports Windows, as well
> as worrying whether FOO itself supports Windows.
> There's still a role for some "gatekeeper" (not a good word IMO, maybe
> "coordinator") to provide a certain level of support or review of
> build systems, and a point of contact for users with build issues (the
> point of this proposal is to some extent that people don't need to
> *know* what build system a project uses, so suggesting everyone has to
> direct issues to the correct build system support forum isn't
> necessarily practical).

I see what you mean, but I don't think there's much that can or should
be done about it in the form of a PEP?

I assume that what will happen is that if you can't build a package,
you'll file a bug with the maintainers of that package, and then it's
their job to figure out whether to patch around the issue locally,
file their own bug upstream with whatever build system package they're
using, switch to a new build system, or whatever. I think we can
generally trust individual projects and the community at large to
figure out what the trade-offs between different systems are, once the
different systems start existing. Though it may well make sense for
the PyPA packaging guide to add a set of best-practice guidelines for
build system implementors.

> > We therefore propose a new, relatively minimal interface for
> > installation tools like ``pip`` to interact with package source trees
> > and source distributions.
> >
> > In addition, we propose a wheel-inspired static metadata format for
> > sdists, suitable for tools like PyPI and pip's resolver.
> >
> >
> > Terminology and goals
> > =====================
> >
> > A *source tree* is something like a VCS checkout. We need a standard
> > interface for installing from this format, to support usages like
> > ``pip install some-directory/``.
> >
> > A *source distribution* is a static snapshot representing a particular
> > release of some source code, like ``lxml-3.4.4.zip``. Source
> > distributions serve many purposes: they form an archival record of
> > releases, they provide a stupid-simple de facto standard for tools
> > that want to ingest and process large corpora of code, possibly
> > written in many languages (e.g. code search), they act as the input to
> > downstream packaging systems like Debian/Fedora/Conda/..., and so
> > forth. In the Python ecosystem they additionally have a particularly
> > important role to play, because packaging tools like ``pip`` are able
> > to use source distributions to fulfill binary dependencies, e.g. if
> > there is a distribution ``foo.whl`` which declares a dependency on
> > ``bar``, then we need to support the case where ``pip install bar`` or
> > ``pip install foo`` automatically locates the sdist for ``bar``,
> > downloads it, builds it, and installs the resulting package.
> This is somewhat misleading, given that you go on to specify the
> format below, but maybe that's only an issue for someone like me who
> saw the previous debate over "source distribution" (as a bundled up
> source tree) vs "sdist" as a specified format. If I understand, you've
> now discarded the former sense of source distribution, and are
> sticking with the latter (specified format) definition.

The "sdists" in this draft try to compromise between the various
concepts that were proposed in the previous thread: you can generally
treat them like bundled up source trees (they have a single directory
that unpacks into something that's laid out similarly to a VCS
checkout), but they also contain additional static metadata to make
PyPI and pip happy (or at least, as much static metadata as they can).

> > Source distributions are also known as "sdists" for short.
> >
> >
> > Source trees
> > ============
> >
> > We retroactively declare the legacy source tree format involving
> > ``setup.py`` to be "version 0". We don't try to specify it further;
> > its de facto specification is encoded in the source code and
> > documentation of ``distutils``, ``setuptools``, ``pip``, and other
> > tools.
> >
> > A "version 1" (or greater) source tree is any directory which contains
> > a file named ``pypackage.cfg``, which will -- in some manner whose
> > details are TBD -- describe the package's build dependencies and how
> > to invoke the build system. This mechanism:
> >
> > - Will allow for both static and dynamic specification of build dependencies
> >
> > - Will have some degree of isolation of different builds from each
> > other, so that it will be possible for a single run of pip to install
> > one package that build-depends on ``foo = 1.1`` and another package
> > that build-depends on ``foo = 1.2``.
> All good so far.
> > - Will leave the actual installation of the package in the hands of
> > the build/installation tool (i.e. individual package build systems
> > will not need to know about things like --user versus --global or make
> > decisions about when and how to modify .pth files)
> This seems completely backwards to me. It's pip's job to do the actual
> install. The build tool should *only* focus on generating standard
> conforming binary wheels - otherwise what's the point of the
> separation of concerns that wheels provide?
> Or maybe I'm confused by the term "build/installation tool" - by that
> did you actually mean pip, rather than the build system?

Yeah, I was just unclear here -- the "build/installation tool" was
supposed to be pip (because pip installs packages! ...and also builds
them), as contrasted with the "individual package build systems" which
don't know anything about installing. I'll reword.

This bullet point is rather substantive, actually, since if adopted
then it rules out the proposed semantics for the "develop" operation
in Robert's PEP. (In current pip and in his proposal, "pip install -e"
is unlike regular "pip install", in that "pip install -e" doesn't
actually install anything, it just calls "setup.py develop", which
does the actual installation. One consequence of this AFAICT is that
if you try passing any of the standard installation target options to
"pip install -e", like "--target" or whatever, then it blows up...)

> (TBDs omitted)
> > Source distributions
> > ====================
> >
> > [possibly this should get split off into a separate PEP, but I'll keep
> > it together for now for ease of discussion]
> >
> > A "version 1" (or greater) source distribution is a file meeting the
> > following criteria:
> >
> > - It MUST have a name of the form: {PACKAGE}-{VERSION}.{EXT}, where
> > {PACKAGE} is the package name, {VERSION} is a PEP 440-compliant
> > version number, and {EXT} is a compliant archive format.
> >
> >   The set of compliant archive formats is: zip, [TBD]
> >
> >   [QUESTION: should we continue to allow .tar.gz and friends? In
> > practice by "allow" I mean something like "accept new-style sdists on
> > PyPI in this format". I'm inclined not to -- zip is the most
> > universally supported format around, it allows file-based random
> > access (unlike tar-based things) which is useful for pulling out
> > metadata without decompressing the whole thing, and standardizing on
> > one format dodges distracting and pointless discussions about which
> > format to use, i.e. it's TOOWTDI-compliant. Of course pip is free to
> > continue to support other archive formats when passed explicitly on
> > the command line. Any objections?]
> +1 on having a single archive format, and zip seems like the best choice.
> >   Similar to wheels, the archive is Unicode, and the filenames inside
> > the archive are encoded in UTF-8.
> This isn't the job of the sdist format to specify. It should be
> implicit in the choice of archive format.

There's a silly typo in the quoted line -- it was supposed to read:

Similar to wheels, the archive *filename* is Unicode, and the
filenames inside the archive are encoded in UTF-8.

These two points were just lifted from PEP 427 without thinking about
it too much -- see https://www.python.org/dev/peps/pep-0427/#id12

Now that I reread that section of PEP 427, the underscore replacement
probably makes sense for sdists as well.

> Having said that, I'd go with
> 1. The sdist filename MUST support the full range of package names as
> specified in PEP 426 (https://www.python.org/dev/peps/pep-0426/#name)
> and versions as in PEP 440
> (https://www.python.org/dev/peps/pep-0440/). That's actually far less
> than full Unicode.
> 2. The archive format MUST support arbitrary Unicode filenames. That
> means zip is OK, but tar.gz isn't unless you specify UTF-8 is used
> (the tar format doesn't allow for an encoding declaration - see
> https://docs.python.org/3.5/library/tarfile.html#tar-unicode for
> details on Unicode issues in the tar format).
> Having said that I'd also go with "filenames in the archive SHOULD be
> limited to ASCII" - because we have had issues with pip where test
> files have Unicode filenames, and builds break because they get
> mangled on systems with weird encoding setups... IIRC, these are
> typically related to .tar.gz sdists, which (due to the lack of
> encoding support) result in files being unpacked with the wrong names.
> So maybe if we enforce zip format we don't need to add this
> limitation.

Especially if we go with zip as the one true archive format, then I
think we should just use the same rules for all this stuff as wheels
do. No need to re-invent the... well, you know.

> > - When unpacked, it MUST contain a single directory directory tree
> > named ``{PACKAGE}-{VERSION}``.
> >
> > - This directory tree MUST be a valid version 1 (or greater) source
> > tree as defined above.
> >
> > - It MUST additionally contain a directory named
> > ``{PACKAGE}-{VERSION}.sdist-info`` (notice the ``s``), with the
> > following contents:
> >
> >   - ``SDIST``: Mandatory. Same record-oriented format as a wheel's
> > ``WHEEL`` file, but with different fields::
> >
> >       SDist-Version: 1.0
> >       Generator: setuptools sdist 20.1
> >
> >     ``SDist-Version`` is the version number of this specification.
> > Software that processes sdists should warn if ``SDist-Version`` is
> > greater than the version it supports, and must fail if
> > ``SDist-Version`` has a greater major version than the version it
> > supports.
> >
> >     ``Generator`` is the name and optionally the version of the
> > software that produced the archive.
> >
> >   - ``RECORD``: Mandatory. A list of all files contained in the sdist
> > (except for the RECORD file itself and any signature files) together
> > with their hashes, as specified in PEP 427.
> >
> >   - ``RECORD.jws``, ``RECORD.p7s``: Optional. Signature files as
> > specified in PEP 427.
> >
> >   - ``METADATA``: Mandatory. Metadata version 1.1 or greater format
> > metadata, with an additional rule that fields may contain the special
> > sentinel value ``__SDIST_DYNAMIC__``, which indicates that the value
> > of this field cannot be determined until build time. If a "multiple
> > use field" is present with the value ``__SDIST_DYNAMIC__``, then this
> > field MUST occur exactly once, e.g.::
> >
> >        # Okay:
> >        Requires-Dist: lxml (> 3.3)
> >        Requires-Dist: requests
> >
> >        # no Requires-Dist lines at all is okay
> >        # (meaning: this package's requirements are the empty set)
> >
> >        # Okay, requirements will be determined at build time:
> >        Requires-Dist: __SDIST_DYNAMIC__
> >
> >        # NOT okay:
> >        Requires-Dist: lxml (> 3.3)
> >        Requires-Dist: __SDIST_DYNAMIC__
> >
> >     (The use of a special token allows us to distinguish between
> > multiple use fields whose value is statically the empty list versus
> > one whose value is dynamic; it also allows us to distinguish between
> > optional fields which are statically not present versus ones whose
> > value is dynamic.)
> >
> >     When this sdist is built, the resulting wheel MUST have metadata
> > which is identical to the metadata present in this file, except that
> > any fields with value ``__SDIST_DYNAMIC__`` in the sdist may have
> > arbitrary values in the wheel.
> >
> >     A valid sdist MUST NOT use the ``__SDIST_DYNAMIC__`` mechanism for
> > the package name or version (i.e., these must be given statically),
> > and these MUST match the {PACKAGE} and {VERSION} of the sdist as
> > described above.
> This seems pretty good at first reading.
> >     [TBD: do we want to forbid the use of dynamic metadata for any
> > other fields? I assume PyPI will enforce some stricter rules at least,
> > but I don't know if we want to make that part of the spec, or just
> > part of PyPI's administrative rules.]
> This covers the main point of contention. It would be bad if build
> systems started using __SDIST_DYNAMIC__ just because "it's easier".
> Maybe add
> * A valid sdist SHOULD NOT use the __SDIST_DYNAMIC__ mechanism any
> more than necessary (i.e., if the metadata is the same in all
> generated wheels, it does not need to use the __SDIST_DYNAMIC__
> mechanism, and so should not do so).
> > This is intentionally a close analogue of a wheel's ``.dist-info``
> > directory; intention is that as future metadata standards are defined,
> > the specifications for the ``.sdist-info`` and ``.dist-info``
> > directories will evolve in synchrony.
> >
> >
> > Evolutionary notes
> > ==================
> >
> > A goal here is to make it as simple as possible to convert old-style
> > sdists to new-style sdists. (E.g., this is one motivation for
> > supporting dynamic build requirements.) The ideal would be that there
> > would be a single static pypackage.cfg that could be dropped into any
> > "version 0" VCS checkout to convert it to the new shiny. This is
> > probably not 100% possible, but we can get close, and it's important
> > to keep track of how close we are... hence this section.
> >
> > A rough plan would be: Create a build system package
> > (``setuptools_pypackage`` or whatever) that knows how to speak
> > whatever hook language we come up with, and convert them into
> > setuptools calls. This will probably require some sort of hooking or
> > monkeypatching to setuptools to provide a way to extract the
> > ``setup_requires=`` argument when needed, and to provide a new version
> > of the sdist command that generates the new-style format. This all
> > seems doable and sufficient for a large proportion of packages (though
> > obviously we'll want to prototype such a system before we finalize
> > anything here). (Alternatively, these changes could be made to
> > setuptools itself rather than going into a separate package.)
> >
> > But there remain two obstacles that mean we probably won't be able to
> > automatically upgrade packages to the new format:
> >
> > 1) There currently exist packages which insist on particular packages
> > being available in their environment before setup.py is executed. This
> > means that if we decide to execute build scripts in an isolated
> > virtualenv-like environment, then projects will need to check whether
> > they do this, and if so then when upgrading to the new system they
> > will have to start explicitly declaring these dependencies (either via
> > ``setup_requires=`` or via static declaration in ``pypackage.cfg``).
> >
> > 2) There currently exist packages which do not declare consistent
> > metadata (e.g. ``egg_info`` and ``bdist_wheel`` might get different
> > ``install_requires=``). When upgrading to the new system, projects
> > will have to evaluate whether this applies to them, and if so they
> > will need to either stop doing that, or else add ``__SDIST_DYNAMIC__``
> > annotations at appropriate places.
> >
> >    We'll also presumably need some API for packages to describe which
> > parts of the METADATA file should be marked ``__SDIST_DYNAMIC__``, for
> > the packages that need it (a new argument to ``setup()`` or some
> > setting in ``setup.cfg`` or something).
> I'm confused here. And it's just now become clear *why* I'm confused.
> The sdist format MUST be a generated format - i.e., we should insist
> (in principle at least) that it's only ever generated by tools.
> Otherwise it's way too easy for people to just zip up their source
> tree, hand craft something generic (that over-uses __SDIST_DYNAMIC__)
> and say "here's an sdist". Obviously, people always *can* manually
> create an sdist but we need to pin down the spec tightly, or we've not
> improved things.

The mandatory RECORD file makes it pretty much impossible to generate
an sdist manually.

> That's why I'm concerned about __SDIST_DYNAMIC__ and it's also what
> confuses me about the above transition plan.
> For people using setuptools currently, the transition should be simply
> that they upgrade setuptools, and the "setup.py sdist" command in the
> new setuptools generates the new sdist format. By default, the
> setuptools sdist process assumes everything is static and requires the
> user to modify the setup.py to explicitly mark which metadata they
> want to be left to build time. That way, we get a relatively
> transparent transition, while avoiding overuse of dynamic metadata.

My assumption was that when a project flips the switch to move to the
new format (not sure what that switch looks like, but presumably we
will have one), then one of the things that happens is that "setup.py
sdist" starts running the equivalent of "egg_info" and stuffing all
the resulting metadata into {PACKAGE}-{VERSION}.sdist-info/ (along
with generating a RECORD file etc.). So the default would be to assume
all metadata is static. But right now that is not actually true for
all projects (for both good and bad reasons), so this means that
before they flip that switch they need to either adjust their setup.py
to make it true, or else they need to use some new API that setuptools
will add to let them specify which fields should be marked as dynamic.

This API will be purely a setuptools-internal thing, though, nothing
that the PEP itself needs to concern itself with.

> If setup.py has to explicitly mark dynamic metadata, that also allows
> us to reject attempts to make name and version dynamic. Which is good.

Presumably PyPI will also reject packages with dynamic names or
versions, so any build system that tries to get away with this will
quickly realize the error of their ways.


Nathaniel J. Smith -- http://vorpus.org

More information about the Distutils-SIG mailing list