[Distutils] Second draft of a plan for a new source tree / sdist format

David Cournapeau cournape at gmail.com
Tue Oct 27 10:00:25 EDT 2015

On Tue, Oct 27, 2015 at 1:12 PM, Daniel Holth <dholth at gmail.com> wrote:

> The drawback of .zip is file size since it compresses each file
> individually rather than giving the compression algorithm a larger input,
> it's a great format otherwise. Ubiquitous including Apple iOS packages,
> Java, word processor file formats. And most Python packages are small.

I don't really buy the indexing advantages, especially w/ the current
implementation of zipfile in python (e.g. loading the whole set of archives
at creation time)

A common way to solve the fast metadata access from archive is to archive
separately the metadata data and data (e.g. a zipfile containing 2
zipfiles, one being the original sdist, the other one containing the


> We must do the hard work to support Unicode file names, and spaces and
> accent marks in home directory names (historically a problem on Windows),
> in our packaging system. It is the right thing to do. It is not the
> publisher's fault that your system has broken Unicode.
> On Tue, Oct 27, 2015 at 6:43 AM Paul Moore <p.f.moore at gmail.com> wrote:
>> On 26 October 2015 at 06:04, Nathaniel Smith <njs at pobox.com> wrote:
>> > Here's a second round of text towards making a build-system
>> > independent interface between pip and source trees/sdists. My idea
>> > this time is to take a divide-and-conquer approach: this text tries to
>> > summarize all the stuff that it seemed like we had mostly reached
>> > consensus on in the previous thread + call, with blank chunks marked
>> > "TBD" where there are specific points that still need To Be
>> > Determined. So my hope is that everyone will read what's here and
>> > agree that it's great as far as it goes, and then we can go through
>> > and fill in each missing piece one at a time.
>> I'll comment on what's here, but ignore the TBD items - I'd rather (as
>> you suggest) leave discussion of those details till the basic idea is
>> agreed.
>> > Abstract
>> > ========
>> >
>> > Distutils delenda est.
>> While this makes a nice tagline, I'd rather something less negative.
>> Distutils does not "need" to be destroyed. It's perfectly adequate
>> (although hardly user friendly) for a lot of cases - I'd be willing to
>> suggest *most* users can work just fine with distutils.
>> I'm not a fan of distutils, but I'd prefer it if we kept the rhetoric
>> limited - as Nick pointed out this whole area is as much a political
>> issue as a technical one.
>> > Extended abstract
>> > =================
>> >
>> > While ``distutils`` / ``setuptools`` have taken us a long way, they
>> > suffer from three serious problems: (a) they're missing important
>> > features like autoconfiguration and usable build-time dependency
>> > declaration, (b) extending them is quirky, complicated, and fragile,
>> > (c) it's very difficult to use anything else, because they provide the
>> > standard interface for installing python packages expected by both
>> > users and installation tools like ``pip``.
>> Again, this is overstated. You very nearly lost me right here - people
>> won't read the details of the proposal if they disagree with the
>> abstract(s). Specifically:
>> * The features in (a) are only important to *some* parts of the
>> community. The scientific community is the major one, and is a huge
>> influence over the direction we want to go in, but again, not crucial
>> to many people. And even where they might be useful (e.g., Windows
>> users building pyyaml, lxml, pillow, ...) the description implies
>> "working out what's there" rather than "allowing users to easily
>> manage non-Python dependencies", which gives the wrong impression.
>> * The features in (b) are highly specialised. Very few people extend
>> setuptools/distutils. And those who do, have often invested a lot of
>> effort in doing so. Sure, they'd rather not have needed to, but now
>> that they have, a replacement system simply means that work is lost.
>> Arguably, fixing (b) is only useful for people (like the scientific
>> community) who have needed to extend setuptools and have been unable
>> to achieve their goals that way. That's an even smaller part of the
>> community.
>> > Previous efforts (e.g. distutils2 or setuptools itself) have attempted
>> > to solve problems (a) and/or (b). We propose to solve (c).
>> Agreed - this is a good approach. But it's at odds with your abstract,
>> which says distutils must die. Here you're saying you want to allow
>> people to keep using distutils but allow people with specialised needs
>> to choose an alternative. Or are you offering an alternative to people
>> who use distutils?
>> The whole of the above is confusing on the face of it. The details
>> below clarify a lot, as does knowing how the previous discussions have
>> gone. But it would help a lot if the introduction to this PEP were
>> clearer.
>> > The goal of this PEP is get distutils-sig out of the business of being
>> > a gatekeeper for Python build systems. If you want to use distutils,
>> > great; if you want to use something else, then that should be easy to
>> > do using standardized methods. The difficulty of interfacing with
>> > distutils means that there aren't many such systems right now, but to
>> > give a sense of what we're thinking about see `flit
>> > <https://github.com/takluyver/flit>`_ or `bento
>> > <https://cournape.github.io/Bento/>`_. Fortunately, wheels have now
>> > solved many of the hard problems here -- e.g. it's no longer necessary
>> > that a build system also know about every possible installation
>> > configuration -- so pretty much all we really need from a build system
>> > is that it have some way to spit out standard-compliant wheels.
>> OK. Although I see a risk here that if I want to build package FOO, I
>> now have to worry whether FOO's build system supports Windows, as well
>> as worrying whether FOO itself supports Windows.
>> There's still a role for some "gatekeeper" (not a good word IMO, maybe
>> "coordinator") to provide a certain level of support or review of
>> build systems, and a point of contact for users with build issues (the
>> point of this proposal is to some extent that people don't need to
>> *know* what build system a project uses, so suggesting everyone has to
>> direct issues to the correct build system support forum isn't
>> necessarily practical).
>> > We therefore propose a new, relatively minimal interface for
>> > installation tools like ``pip`` to interact with package source trees
>> > and source distributions.
>> >
>> > In addition, we propose a wheel-inspired static metadata format for
>> > sdists, suitable for tools like PyPI and pip's resolver.
>> >
>> >
>> > Terminology and goals
>> > =====================
>> >
>> > A *source tree* is something like a VCS checkout. We need a standard
>> > interface for installing from this format, to support usages like
>> > ``pip install some-directory/``.
>> >
>> > A *source distribution* is a static snapshot representing a particular
>> > release of some source code, like ``lxml-3.4.4.zip``. Source
>> > distributions serve many purposes: they form an archival record of
>> > releases, they provide a stupid-simple de facto standard for tools
>> > that want to ingest and process large corpora of code, possibly
>> > written in many languages (e.g. code search), they act as the input to
>> > downstream packaging systems like Debian/Fedora/Conda/..., and so
>> > forth. In the Python ecosystem they additionally have a particularly
>> > important role to play, because packaging tools like ``pip`` are able
>> > to use source distributions to fulfill binary dependencies, e.g. if
>> > there is a distribution ``foo.whl`` which declares a dependency on
>> > ``bar``, then we need to support the case where ``pip install bar`` or
>> > ``pip install foo`` automatically locates the sdist for ``bar``,
>> > downloads it, builds it, and installs the resulting package.
>> This is somewhat misleading, given that you go on to specify the
>> format below, but maybe that's only an issue for someone like me who
>> saw the previous debate over "source distribution" (as a bundled up
>> source tree) vs "sdist" as a specified format. If I understand, you've
>> now discarded the former sense of source distribution, and are
>> sticking with the latter (specified format) definition.
>> > Source distributions are also known as "sdists" for short.
>> >
>> >
>> > Source trees
>> > ============
>> >
>> > We retroactively declare the legacy source tree format involving
>> > ``setup.py`` to be "version 0". We don't try to specify it further;
>> > its de facto specification is encoded in the source code and
>> > documentation of ``distutils``, ``setuptools``, ``pip``, and other
>> > tools.
>> >
>> > A "version 1" (or greater) source tree is any directory which contains
>> > a file named ``pypackage.cfg``, which will -- in some manner whose
>> > details are TBD -- describe the package's build dependencies and how
>> > to invoke the build system. This mechanism:
>> >
>> > - Will allow for both static and dynamic specification of build
>> dependencies
>> >
>> > - Will have some degree of isolation of different builds from each
>> > other, so that it will be possible for a single run of pip to install
>> > one package that build-depends on ``foo = 1.1`` and another package
>> > that build-depends on ``foo = 1.2``.
>> All good so far.
>> > - Will leave the actual installation of the package in the hands of
>> > the build/installation tool (i.e. individual package build systems
>> > will not need to know about things like --user versus --global or make
>> > decisions about when and how to modify .pth files)
>> This seems completely backwards to me. It's pip's job to do the actual
>> install. The build tool should *only* focus on generating standard
>> conforming binary wheels - otherwise what's the point of the
>> separation of concerns that wheels provide?
>> Or maybe I'm confused by the term "build/installation tool" - by that
>> did you actually mean pip, rather than the build system?
>> (TBDs omitted)
>> > Source distributions
>> > ====================
>> >
>> > [possibly this should get split off into a separate PEP, but I'll keep
>> > it together for now for ease of discussion]
>> >
>> > A "version 1" (or greater) source distribution is a file meeting the
>> > following criteria:
>> >
>> > - It MUST have a name of the form: {PACKAGE}-{VERSION}.{EXT}, where
>> > {PACKAGE} is the package name, {VERSION} is a PEP 440-compliant
>> > version number, and {EXT} is a compliant archive format.
>> >
>> >   The set of compliant archive formats is: zip, [TBD]
>> >
>> >   [QUESTION: should we continue to allow .tar.gz and friends? In
>> > practice by "allow" I mean something like "accept new-style sdists on
>> > PyPI in this format". I'm inclined not to -- zip is the most
>> > universally supported format around, it allows file-based random
>> > access (unlike tar-based things) which is useful for pulling out
>> > metadata without decompressing the whole thing, and standardizing on
>> > one format dodges distracting and pointless discussions about which
>> > format to use, i.e. it's TOOWTDI-compliant. Of course pip is free to
>> > continue to support other archive formats when passed explicitly on
>> > the command line. Any objections?]
>> +1 on having a single archive format, and zip seems like the best choice.
>> >   Similar to wheels, the archive is Unicode, and the filenames inside
>> > the archive are encoded in UTF-8.
>> This isn't the job of the sdist format to specify. It should be
>> implicit in the choice of archive format.
>> Having said that, I'd go with
>> 1. The sdist filename MUST support the full range of package names as
>> specified in PEP 426 (https://www.python.org/dev/peps/pep-0426/#name)
>> and versions as in PEP 440
>> (https://www.python.org/dev/peps/pep-0440/). That's actually far less
>> than full Unicode.
>> 2. The archive format MUST support arbitrary Unicode filenames. That
>> means zip is OK, but tar.gz isn't unless you specify UTF-8 is used
>> (the tar format doesn't allow for an encoding declaration - see
>> https://docs.python.org/3.5/library/tarfile.html#tar-unicode for
>> details on Unicode issues in the tar format).
>> Having said that I'd also go with "filenames in the archive SHOULD be
>> limited to ASCII" - because we have had issues with pip where test
>> files have Unicode filenames, and builds break because they get
>> mangled on systems with weird encoding setups... IIRC, these are
>> typically related to .tar.gz sdists, which (due to the lack of
>> encoding support) result in files being unpacked with the wrong names.
>> So maybe if we enforce zip format we don't need to add this
>> limitation.
>> > - When unpacked, it MUST contain a single directory directory tree
>> > named ``{PACKAGE}-{VERSION}``.
>> >
>> > - This directory tree MUST be a valid version 1 (or greater) source
>> > tree as defined above.
>> >
>> > - It MUST additionally contain a directory named
>> > ``{PACKAGE}-{VERSION}.sdist-info`` (notice the ``s``), with the
>> > following contents:
>> >
>> >   - ``SDIST``: Mandatory. Same record-oriented format as a wheel's
>> > ``WHEEL`` file, but with different fields::
>> >
>> >       SDist-Version: 1.0
>> >       Generator: setuptools sdist 20.1
>> >
>> >     ``SDist-Version`` is the version number of this specification.
>> > Software that processes sdists should warn if ``SDist-Version`` is
>> > greater than the version it supports, and must fail if
>> > ``SDist-Version`` has a greater major version than the version it
>> > supports.
>> >
>> >     ``Generator`` is the name and optionally the version of the
>> > software that produced the archive.
>> >
>> >   - ``RECORD``: Mandatory. A list of all files contained in the sdist
>> > (except for the RECORD file itself and any signature files) together
>> > with their hashes, as specified in PEP 427.
>> >
>> >   - ``RECORD.jws``, ``RECORD.p7s``: Optional. Signature files as
>> > specified in PEP 427.
>> >
>> >   - ``METADATA``: Mandatory. Metadata version 1.1 or greater format
>> > metadata, with an additional rule that fields may contain the special
>> > sentinel value ``__SDIST_DYNAMIC__``, which indicates that the value
>> > of this field cannot be determined until build time. If a "multiple
>> > use field" is present with the value ``__SDIST_DYNAMIC__``, then this
>> > field MUST occur exactly once, e.g.::
>> >
>> >        # Okay:
>> >        Requires-Dist: lxml (> 3.3)
>> >        Requires-Dist: requests
>> >
>> >        # no Requires-Dist lines at all is okay
>> >        # (meaning: this package's requirements are the empty set)
>> >
>> >        # Okay, requirements will be determined at build time:
>> >        Requires-Dist: __SDIST_DYNAMIC__
>> >
>> >        # NOT okay:
>> >        Requires-Dist: lxml (> 3.3)
>> >        Requires-Dist: __SDIST_DYNAMIC__
>> >
>> >     (The use of a special token allows us to distinguish between
>> > multiple use fields whose value is statically the empty list versus
>> > one whose value is dynamic; it also allows us to distinguish between
>> > optional fields which are statically not present versus ones whose
>> > value is dynamic.)
>> >
>> >     When this sdist is built, the resulting wheel MUST have metadata
>> > which is identical to the metadata present in this file, except that
>> > any fields with value ``__SDIST_DYNAMIC__`` in the sdist may have
>> > arbitrary values in the wheel.
>> >
>> >     A valid sdist MUST NOT use the ``__SDIST_DYNAMIC__`` mechanism for
>> > the package name or version (i.e., these must be given statically),
>> > and these MUST match the {PACKAGE} and {VERSION} of the sdist as
>> > described above.
>> This seems pretty good at first reading.
>> >     [TBD: do we want to forbid the use of dynamic metadata for any
>> > other fields? I assume PyPI will enforce some stricter rules at least,
>> > but I don't know if we want to make that part of the spec, or just
>> > part of PyPI's administrative rules.]
>> This covers the main point of contention. It would be bad if build
>> systems started using __SDIST_DYNAMIC__ just because "it's easier".
>> Maybe add
>> * A valid sdist SHOULD NOT use the __SDIST_DYNAMIC__ mechanism any
>> more than necessary (i.e., if the metadata is the same in all
>> generated wheels, it does not need to use the __SDIST_DYNAMIC__
>> mechanism, and so should not do so).
>> > This is intentionally a close analogue of a wheel's ``.dist-info``
>> > directory; intention is that as future metadata standards are defined,
>> > the specifications for the ``.sdist-info`` and ``.dist-info``
>> > directories will evolve in synchrony.
>> >
>> >
>> > Evolutionary notes
>> > ==================
>> >
>> > A goal here is to make it as simple as possible to convert old-style
>> > sdists to new-style sdists. (E.g., this is one motivation for
>> > supporting dynamic build requirements.) The ideal would be that there
>> > would be a single static pypackage.cfg that could be dropped into any
>> > "version 0" VCS checkout to convert it to the new shiny. This is
>> > probably not 100% possible, but we can get close, and it's important
>> > to keep track of how close we are... hence this section.
>> >
>> > A rough plan would be: Create a build system package
>> > (``setuptools_pypackage`` or whatever) that knows how to speak
>> > whatever hook language we come up with, and convert them into
>> > setuptools calls. This will probably require some sort of hooking or
>> > monkeypatching to setuptools to provide a way to extract the
>> > ``setup_requires=`` argument when needed, and to provide a new version
>> > of the sdist command that generates the new-style format. This all
>> > seems doable and sufficient for a large proportion of packages (though
>> > obviously we'll want to prototype such a system before we finalize
>> > anything here). (Alternatively, these changes could be made to
>> > setuptools itself rather than going into a separate package.)
>> >
>> > But there remain two obstacles that mean we probably won't be able to
>> > automatically upgrade packages to the new format:
>> >
>> > 1) There currently exist packages which insist on particular packages
>> > being available in their environment before setup.py is executed. This
>> > means that if we decide to execute build scripts in an isolated
>> > virtualenv-like environment, then projects will need to check whether
>> > they do this, and if so then when upgrading to the new system they
>> > will have to start explicitly declaring these dependencies (either via
>> > ``setup_requires=`` or via static declaration in ``pypackage.cfg``).
>> >
>> > 2) There currently exist packages which do not declare consistent
>> > metadata (e.g. ``egg_info`` and ``bdist_wheel`` might get different
>> > ``install_requires=``). When upgrading to the new system, projects
>> > will have to evaluate whether this applies to them, and if so they
>> > will need to either stop doing that, or else add ``__SDIST_DYNAMIC__``
>> > annotations at appropriate places.
>> >
>> >    We'll also presumably need some API for packages to describe which
>> > parts of the METADATA file should be marked ``__SDIST_DYNAMIC__``, for
>> > the packages that need it (a new argument to ``setup()`` or some
>> > setting in ``setup.cfg`` or something).
>> I'm confused here. And it's just now become clear *why* I'm confused.
>> The sdist format MUST be a generated format - i.e., we should insist
>> (in principle at least) that it's only ever generated by tools.
>> Otherwise it's way too easy for people to just zip up their source
>> tree, hand craft something generic (that over-uses __SDIST_DYNAMIC__)
>> and say "here's an sdist". Obviously, people always *can* manually
>> create an sdist but we need to pin down the spec tightly, or we've not
>> improved things.
>> That's why I'm concerned about __SDIST_DYNAMIC__ and it's also what
>> confuses me about the above transition plan.
>> For people using setuptools currently, the transition should be simply
>> that they upgrade setuptools, and the "setup.py sdist" command in the
>> new setuptools generates the new sdist format. By default, the
>> setuptools sdist process assumes everything is static and requires the
>> user to modify the setup.py to explicitly mark which metadata they
>> want to be left to build time. That way, we get a relatively
>> transparent transition, while avoiding overuse of dynamic metadata.
>> If setup.py has to explicitly mark dynamic metadata, that also allows
>> us to reject attempts to make name and version dynamic. Which is good.
>> Paul
>> _______________________________________________
>> Distutils-SIG maillist  -  Distutils-SIG at python.org
>> https://mail.python.org/mailman/listinfo/distutils-sig
> _______________________________________________
> Distutils-SIG maillist  -  Distutils-SIG at python.org
> https://mail.python.org/mailman/listinfo/distutils-sig
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20151027/b87f0cf7/attachment-0001.html>

More information about the Distutils-SIG mailing list