[Distutils] Second draft of a plan for a new source tree / sdist format
Paul Moore
p.f.moore at gmail.com
Tue Oct 27 06:43:01 EDT 2015
On 26 October 2015 at 06:04, Nathaniel Smith <njs at pobox.com> wrote:
> Here's a second round of text towards making a build-system
> independent interface between pip and source trees/sdists. My idea
> this time is to take a divide-and-conquer approach: this text tries to
> summarize all the stuff that it seemed like we had mostly reached
> consensus on in the previous thread + call, with blank chunks marked
> "TBD" where there are specific points that still need To Be
> Determined. So my hope is that everyone will read what's here and
> agree that it's great as far as it goes, and then we can go through
> and fill in each missing piece one at a time.
I'll comment on what's here, but ignore the TBD items - I'd rather (as
you suggest) leave discussion of those details till the basic idea is
agreed.
> Abstract
> ========
>
> Distutils delenda est.
While this makes a nice tagline, I'd rather something less negative.
Distutils does not "need" to be destroyed. It's perfectly adequate
(although hardly user friendly) for a lot of cases - I'd be willing to
suggest *most* users can work just fine with distutils.
I'm not a fan of distutils, but I'd prefer it if we kept the rhetoric
limited - as Nick pointed out this whole area is as much a political
issue as a technical one.
> Extended abstract
> =================
>
> While ``distutils`` / ``setuptools`` have taken us a long way, they
> suffer from three serious problems: (a) they're missing important
> features like autoconfiguration and usable build-time dependency
> declaration, (b) extending them is quirky, complicated, and fragile,
> (c) it's very difficult to use anything else, because they provide the
> standard interface for installing python packages expected by both
> users and installation tools like ``pip``.
Again, this is overstated. You very nearly lost me right here - people
won't read the details of the proposal if they disagree with the
abstract(s). Specifically:
* The features in (a) are only important to *some* parts of the
community. The scientific community is the major one, and is a huge
influence over the direction we want to go in, but again, not crucial
to many people. And even where they might be useful (e.g., Windows
users building pyyaml, lxml, pillow, ...) the description implies
"working out what's there" rather than "allowing users to easily
manage non-Python dependencies", which gives the wrong impression.
* The features in (b) are highly specialised. Very few people extend
setuptools/distutils. And those who do, have often invested a lot of
effort in doing so. Sure, they'd rather not have needed to, but now
that they have, a replacement system simply means that work is lost.
Arguably, fixing (b) is only useful for people (like the scientific
community) who have needed to extend setuptools and have been unable
to achieve their goals that way. That's an even smaller part of the
community.
> Previous efforts (e.g. distutils2 or setuptools itself) have attempted
> to solve problems (a) and/or (b). We propose to solve (c).
Agreed - this is a good approach. But it's at odds with your abstract,
which says distutils must die. Here you're saying you want to allow
people to keep using distutils but allow people with specialised needs
to choose an alternative. Or are you offering an alternative to people
who use distutils?
The whole of the above is confusing on the face of it. The details
below clarify a lot, as does knowing how the previous discussions have
gone. But it would help a lot if the introduction to this PEP were
clearer.
> The goal of this PEP is get distutils-sig out of the business of being
> a gatekeeper for Python build systems. If you want to use distutils,
> great; if you want to use something else, then that should be easy to
> do using standardized methods. The difficulty of interfacing with
> distutils means that there aren't many such systems right now, but to
> give a sense of what we're thinking about see `flit
> <https://github.com/takluyver/flit>`_ or `bento
> <https://cournape.github.io/Bento/>`_. Fortunately, wheels have now
> solved many of the hard problems here -- e.g. it's no longer necessary
> that a build system also know about every possible installation
> configuration -- so pretty much all we really need from a build system
> is that it have some way to spit out standard-compliant wheels.
OK. Although I see a risk here that if I want to build package FOO, I
now have to worry whether FOO's build system supports Windows, as well
as worrying whether FOO itself supports Windows.
There's still a role for some "gatekeeper" (not a good word IMO, maybe
"coordinator") to provide a certain level of support or review of
build systems, and a point of contact for users with build issues (the
point of this proposal is to some extent that people don't need to
*know* what build system a project uses, so suggesting everyone has to
direct issues to the correct build system support forum isn't
necessarily practical).
> We therefore propose a new, relatively minimal interface for
> installation tools like ``pip`` to interact with package source trees
> and source distributions.
>
> In addition, we propose a wheel-inspired static metadata format for
> sdists, suitable for tools like PyPI and pip's resolver.
>
>
> Terminology and goals
> =====================
>
> A *source tree* is something like a VCS checkout. We need a standard
> interface for installing from this format, to support usages like
> ``pip install some-directory/``.
>
> A *source distribution* is a static snapshot representing a particular
> release of some source code, like ``lxml-3.4.4.zip``. Source
> distributions serve many purposes: they form an archival record of
> releases, they provide a stupid-simple de facto standard for tools
> that want to ingest and process large corpora of code, possibly
> written in many languages (e.g. code search), they act as the input to
> downstream packaging systems like Debian/Fedora/Conda/..., and so
> forth. In the Python ecosystem they additionally have a particularly
> important role to play, because packaging tools like ``pip`` are able
> to use source distributions to fulfill binary dependencies, e.g. if
> there is a distribution ``foo.whl`` which declares a dependency on
> ``bar``, then we need to support the case where ``pip install bar`` or
> ``pip install foo`` automatically locates the sdist for ``bar``,
> downloads it, builds it, and installs the resulting package.
This is somewhat misleading, given that you go on to specify the
format below, but maybe that's only an issue for someone like me who
saw the previous debate over "source distribution" (as a bundled up
source tree) vs "sdist" as a specified format. If I understand, you've
now discarded the former sense of source distribution, and are
sticking with the latter (specified format) definition.
> Source distributions are also known as "sdists" for short.
>
>
> Source trees
> ============
>
> We retroactively declare the legacy source tree format involving
> ``setup.py`` to be "version 0". We don't try to specify it further;
> its de facto specification is encoded in the source code and
> documentation of ``distutils``, ``setuptools``, ``pip``, and other
> tools.
>
> A "version 1" (or greater) source tree is any directory which contains
> a file named ``pypackage.cfg``, which will -- in some manner whose
> details are TBD -- describe the package's build dependencies and how
> to invoke the build system. This mechanism:
>
> - Will allow for both static and dynamic specification of build dependencies
>
> - Will have some degree of isolation of different builds from each
> other, so that it will be possible for a single run of pip to install
> one package that build-depends on ``foo = 1.1`` and another package
> that build-depends on ``foo = 1.2``.
All good so far.
> - Will leave the actual installation of the package in the hands of
> the build/installation tool (i.e. individual package build systems
> will not need to know about things like --user versus --global or make
> decisions about when and how to modify .pth files)
This seems completely backwards to me. It's pip's job to do the actual
install. The build tool should *only* focus on generating standard
conforming binary wheels - otherwise what's the point of the
separation of concerns that wheels provide?
Or maybe I'm confused by the term "build/installation tool" - by that
did you actually mean pip, rather than the build system?
(TBDs omitted)
> Source distributions
> ====================
>
> [possibly this should get split off into a separate PEP, but I'll keep
> it together for now for ease of discussion]
>
> A "version 1" (or greater) source distribution is a file meeting the
> following criteria:
>
> - It MUST have a name of the form: {PACKAGE}-{VERSION}.{EXT}, where
> {PACKAGE} is the package name, {VERSION} is a PEP 440-compliant
> version number, and {EXT} is a compliant archive format.
>
> The set of compliant archive formats is: zip, [TBD]
>
> [QUESTION: should we continue to allow .tar.gz and friends? In
> practice by "allow" I mean something like "accept new-style sdists on
> PyPI in this format". I'm inclined not to -- zip is the most
> universally supported format around, it allows file-based random
> access (unlike tar-based things) which is useful for pulling out
> metadata without decompressing the whole thing, and standardizing on
> one format dodges distracting and pointless discussions about which
> format to use, i.e. it's TOOWTDI-compliant. Of course pip is free to
> continue to support other archive formats when passed explicitly on
> the command line. Any objections?]
+1 on having a single archive format, and zip seems like the best choice.
> Similar to wheels, the archive is Unicode, and the filenames inside
> the archive are encoded in UTF-8.
This isn't the job of the sdist format to specify. It should be
implicit in the choice of archive format.
Having said that, I'd go with
1. The sdist filename MUST support the full range of package names as
specified in PEP 426 (https://www.python.org/dev/peps/pep-0426/#name)
and versions as in PEP 440
(https://www.python.org/dev/peps/pep-0440/). That's actually far less
than full Unicode.
2. The archive format MUST support arbitrary Unicode filenames. That
means zip is OK, but tar.gz isn't unless you specify UTF-8 is used
(the tar format doesn't allow for an encoding declaration - see
https://docs.python.org/3.5/library/tarfile.html#tar-unicode for
details on Unicode issues in the tar format).
Having said that I'd also go with "filenames in the archive SHOULD be
limited to ASCII" - because we have had issues with pip where test
files have Unicode filenames, and builds break because they get
mangled on systems with weird encoding setups... IIRC, these are
typically related to .tar.gz sdists, which (due to the lack of
encoding support) result in files being unpacked with the wrong names.
So maybe if we enforce zip format we don't need to add this
limitation.
> - When unpacked, it MUST contain a single directory directory tree
> named ``{PACKAGE}-{VERSION}``.
>
> - This directory tree MUST be a valid version 1 (or greater) source
> tree as defined above.
>
> - It MUST additionally contain a directory named
> ``{PACKAGE}-{VERSION}.sdist-info`` (notice the ``s``), with the
> following contents:
>
> - ``SDIST``: Mandatory. Same record-oriented format as a wheel's
> ``WHEEL`` file, but with different fields::
>
> SDist-Version: 1.0
> Generator: setuptools sdist 20.1
>
> ``SDist-Version`` is the version number of this specification.
> Software that processes sdists should warn if ``SDist-Version`` is
> greater than the version it supports, and must fail if
> ``SDist-Version`` has a greater major version than the version it
> supports.
>
> ``Generator`` is the name and optionally the version of the
> software that produced the archive.
>
> - ``RECORD``: Mandatory. A list of all files contained in the sdist
> (except for the RECORD file itself and any signature files) together
> with their hashes, as specified in PEP 427.
>
> - ``RECORD.jws``, ``RECORD.p7s``: Optional. Signature files as
> specified in PEP 427.
>
> - ``METADATA``: Mandatory. Metadata version 1.1 or greater format
> metadata, with an additional rule that fields may contain the special
> sentinel value ``__SDIST_DYNAMIC__``, which indicates that the value
> of this field cannot be determined until build time. If a "multiple
> use field" is present with the value ``__SDIST_DYNAMIC__``, then this
> field MUST occur exactly once, e.g.::
>
> # Okay:
> Requires-Dist: lxml (> 3.3)
> Requires-Dist: requests
>
> # no Requires-Dist lines at all is okay
> # (meaning: this package's requirements are the empty set)
>
> # Okay, requirements will be determined at build time:
> Requires-Dist: __SDIST_DYNAMIC__
>
> # NOT okay:
> Requires-Dist: lxml (> 3.3)
> Requires-Dist: __SDIST_DYNAMIC__
>
> (The use of a special token allows us to distinguish between
> multiple use fields whose value is statically the empty list versus
> one whose value is dynamic; it also allows us to distinguish between
> optional fields which are statically not present versus ones whose
> value is dynamic.)
>
> When this sdist is built, the resulting wheel MUST have metadata
> which is identical to the metadata present in this file, except that
> any fields with value ``__SDIST_DYNAMIC__`` in the sdist may have
> arbitrary values in the wheel.
>
> A valid sdist MUST NOT use the ``__SDIST_DYNAMIC__`` mechanism for
> the package name or version (i.e., these must be given statically),
> and these MUST match the {PACKAGE} and {VERSION} of the sdist as
> described above.
This seems pretty good at first reading.
> [TBD: do we want to forbid the use of dynamic metadata for any
> other fields? I assume PyPI will enforce some stricter rules at least,
> but I don't know if we want to make that part of the spec, or just
> part of PyPI's administrative rules.]
This covers the main point of contention. It would be bad if build
systems started using __SDIST_DYNAMIC__ just because "it's easier".
Maybe add
* A valid sdist SHOULD NOT use the __SDIST_DYNAMIC__ mechanism any
more than necessary (i.e., if the metadata is the same in all
generated wheels, it does not need to use the __SDIST_DYNAMIC__
mechanism, and so should not do so).
> This is intentionally a close analogue of a wheel's ``.dist-info``
> directory; intention is that as future metadata standards are defined,
> the specifications for the ``.sdist-info`` and ``.dist-info``
> directories will evolve in synchrony.
>
>
> Evolutionary notes
> ==================
>
> A goal here is to make it as simple as possible to convert old-style
> sdists to new-style sdists. (E.g., this is one motivation for
> supporting dynamic build requirements.) The ideal would be that there
> would be a single static pypackage.cfg that could be dropped into any
> "version 0" VCS checkout to convert it to the new shiny. This is
> probably not 100% possible, but we can get close, and it's important
> to keep track of how close we are... hence this section.
>
> A rough plan would be: Create a build system package
> (``setuptools_pypackage`` or whatever) that knows how to speak
> whatever hook language we come up with, and convert them into
> setuptools calls. This will probably require some sort of hooking or
> monkeypatching to setuptools to provide a way to extract the
> ``setup_requires=`` argument when needed, and to provide a new version
> of the sdist command that generates the new-style format. This all
> seems doable and sufficient for a large proportion of packages (though
> obviously we'll want to prototype such a system before we finalize
> anything here). (Alternatively, these changes could be made to
> setuptools itself rather than going into a separate package.)
>
> But there remain two obstacles that mean we probably won't be able to
> automatically upgrade packages to the new format:
>
> 1) There currently exist packages which insist on particular packages
> being available in their environment before setup.py is executed. This
> means that if we decide to execute build scripts in an isolated
> virtualenv-like environment, then projects will need to check whether
> they do this, and if so then when upgrading to the new system they
> will have to start explicitly declaring these dependencies (either via
> ``setup_requires=`` or via static declaration in ``pypackage.cfg``).
>
> 2) There currently exist packages which do not declare consistent
> metadata (e.g. ``egg_info`` and ``bdist_wheel`` might get different
> ``install_requires=``). When upgrading to the new system, projects
> will have to evaluate whether this applies to them, and if so they
> will need to either stop doing that, or else add ``__SDIST_DYNAMIC__``
> annotations at appropriate places.
>
> We'll also presumably need some API for packages to describe which
> parts of the METADATA file should be marked ``__SDIST_DYNAMIC__``, for
> the packages that need it (a new argument to ``setup()`` or some
> setting in ``setup.cfg`` or something).
I'm confused here. And it's just now become clear *why* I'm confused.
The sdist format MUST be a generated format - i.e., we should insist
(in principle at least) that it's only ever generated by tools.
Otherwise it's way too easy for people to just zip up their source
tree, hand craft something generic (that over-uses __SDIST_DYNAMIC__)
and say "here's an sdist". Obviously, people always *can* manually
create an sdist but we need to pin down the spec tightly, or we've not
improved things.
That's why I'm concerned about __SDIST_DYNAMIC__ and it's also what
confuses me about the above transition plan.
For people using setuptools currently, the transition should be simply
that they upgrade setuptools, and the "setup.py sdist" command in the
new setuptools generates the new sdist format. By default, the
setuptools sdist process assumes everything is static and requires the
user to modify the setup.py to explicitly mark which metadata they
want to be left to build time. That way, we get a relatively
transparent transition, while avoiding overuse of dynamic metadata.
If setup.py has to explicitly mark dynamic metadata, that also allows
us to reject attempts to make name and version dynamic. Which is good.
Paul
More information about the Distutils-SIG
mailing list