[Distutils] Towards a simple and standard sdist format that isn't intertwined with distutils

Paul Moore p.f.moore at gmail.com
Mon Oct 5 13:28:48 CEST 2015


OK, I've had a better read of your email now. Responses inline.

On 5 October 2015 at 07:29, Nathaniel Smith <njs at pobox.com> wrote:
> First, let's drop the word "sdist", it's confusing.

We can't (see below for details). We can deprecate the sdist concept,
if that's what you want to propose. From what I gather, you're
proposing deprecating it in favour of a "source wheel" concept. I
don't have a huge issue with that other than that I don't see the
necessity - the sdist concept pretty much covers what you want, except
maybe that it's not clear enough to people outside the packaging
community how it differs from a VCS checkout.

> I'm starting from the long history and conventions around how people make
> what I'll call "source releases" (and in a few paragraphs will contrast with
> "source wheels"). 'Everyone knows' that when you release a new version of
> some package (in pretty much any language), then one key step is to put
> together a file called <package>-<version>.<zip or .tar.gz>. And 'everyone
> knows' that if you see a file that follows this naming convention, and you
> download it, then what you'll find inside is: a single directory called
> <package>-<version>/, and inside this directory will be something that's
> almost like a VCS checkout -- it'll probably contain a README, source files
> in convenient places to be edited or grepped, etc. The differences from a
> VCS checkout (if any) will be little convenience stuff -- like ./autogen.sh
> will have been run already, or there will be an extra file containing a
> fixed version number instead of it being autodetected, or -DNDEBUG will be
> in the default CFLAGS, or Cython files will have been pre-translated to C --
> but fundamentally it will be similar to a VCS checkout, and switching back
> and forth between them won't be too jarring. 95% of the time there will be a
> standard way to build the thing ('./configure && make && make install', or
> 'python setup.py install', or similar).

Breaking at this point, because that's frankly *not* the reality in
the Python packaging world (at least not nowadays - I'm not clear to
what extent you're just talking about history and background here,
although your reference to Cython makes me think you're talking in
terms of current practice). It may look like that, but there are some
fundamental differences.

First and foremost, nobody zips up and publishes their VCS checkout in
the way you describe. (At least not if they are using the standard
tools - distutils and setuptools). Instead, they create a "sdist"
using the "python setup.py sdist" command. I'm sorry, but I'm going to
carry on using the "sdist" term here, because I'm describing current
practice and sdists *are* current practice.

The difference is between a sdist and what you call a "source release"
is subtle, precisely because the current sdist format is a bit of a
mess, but the key point is that all sdists are created by a standard
process, and conform to a standard naming convention and layout. The
packaging tools rely on being able to make that assumption, in all
sorts of ways which we're doing our best to clarify as part of this
thread, but which honestly have been a little bit implicit up to this
point.

Further muddying the water is the fact that as you say, pip needs to
be able to build from a VCS checkout (a directory on the user's local
system) and we have code in pip that does that - mostly by assuming
that you can treat a VCS checkout as an unpacked sdist, but there are
hacks we need to do to make that work (we run setup.py egg-info to get
the metadata we need, for example, which has implications as we only
get that data at a later stage than we have it in the sdist case) and
differences in functionality (develop mode).

At this point I'm not saying that things have to be this way, or even
that "make a source release however you choose as long as it follows
these conventions" isn't a viable way forward, but I do think we need
to agree on our picture of how things are now, or we'll continue
talking past each other.

> And these kind of source releases
> have a rich ecosystem around them and serve a wide range of uses: they
> provide a low-tech archival record (while VCS's come and go), they end up in
> deb and rpm "original source" bundles, they get downloaded by users and
> built by hand (maybe with weird configury on top, like a hack to enable
> cross-compilation) or poked around in by hand, etc. etc. When sdists were
> originally designed, then "source releases" is what the designers were
> thinking about.

This, on the other hand, I suspect is not far from the truth. When
sdists were designed, they were a convenience for bundling the stuff
needed to do setup.py install later, possibly on a different machine.

But that's a long time ago, and not really relevant now. For better or
worse. Unless you are suggesting that we go all the way back to that
original point? Which you may be, but that means discarding the work
that's been done based on the sdist concept since then. Which leads
nicely on to...

> Then, easy_install came along, and pulled off a clever hack where when you
> asked for some package by name, then it would try to automagically go out
> and track down any relevant source releases and build them all. And it works
> great, except when it doesn't. And creates massive headaches for everyone
> trying to work on python packaging afterwards, because source releases were
> not designed to be used this way.
>
> My hypothesis is that the requirements that were confusing me are based
> around the idea that an sdist should be something designed to slot into this
> particular use case: i.e., something that pip can automatically grab and
> work with while solving a dependency resolution problem. Therefore it really
> needs to have a static name, and static version number, and static
> dependencies, and must produce exactly one binary wheel that shares all that
> metadata.

Anyone who knows my history will know that I'm the last person to
defend setuptools' hacks, but you hit the nail on the head above. When
it works, it works great (meh, "sufficiently well" :-))

And pip *needs* to do static dependency resolution. We have enough bug
reports and feature requests asking that we improve the dependency
resolution process that nobody is going to be happy with anything that
doesn't allow for at least as much static information as we currently
have, and ultimately more.

> Let's call this a "source wheel" -- what we're really looking for
> here is a way to ship extension modules inside something that acts like an
> architecture-neutral "none-any" wheel.

I don't understand this statement. What do extension modules matter
here? We need to be able to ship sources in a form that can
participate in dependency resolution (and any other potential
discovery processes that may turn up in future) without having to run
code to do so. The reasons for this are:

1. Running code is a performance overhead, and possibly even a
security risk (even trusted code may behave in ways you didn't
anticipate). We want to do as little as possible of that as we can,
and in particular we want to discard invalid candidate files without
running any of their code.
2. Running code introduces the possibility of that code failing. We
don't want end users to have installs fail because code in
distributions we're going to discard is buggy.
3. Repositories like PyPI need to present project metadata, for both
human and tool consumption - they can only do this if it's available
statically.

You seem to be thinking that binary wheels are sufficient for this for
pure-Python code. Look at it the other way - we discard sdists from
the dependency calculations whenever there's an equivalent binary
wheel available. That's always for non-any wheels, but less often for
architecture-dependent wheels. But in order to know that the wheel is
equivalent, we need to match it with the sdist - so the sdist needs
the metadata you're trying to argue against providing...

> So: the email that started this thread was a proposal for how to standardize
> the format of "source releases", and Donald's counter was a proposal for how
> to standardize the format of "source wheels". Does that sound right?

Well, essentially yes, although I read it as your original email being
a proposal for a new format to replace sdists, and Donald's and my
counter is that there's already a been a certain amount of thinking
and design gone into how we move from the current ad-hoc sdist format
to a better defined and specified "next version", so how does your
proposal affect that?

It seems that your answer is that you want to bypass that and offer an
alternative. Is that fair?

For the record, I don't like the term "source wheel" and would prefer
to stick with "sdist" if appropriate, or choose a term that doesn't
include the word "wheel" otherwise (as wheels seem to me to be
strongly, and beneficially, linked in people's minds to the concept of
a binary release format).

> If so, then some follow-up thoughts:
>
> 1) If we design a source wheel format, then I am 100% in favor of the
> suggestion of giving it a unique extension like "swhl". I'm still a young
> whippersnapper compared to some, but I've been downloading files named
> <package>-<version>.<zip or tar.gz> for 20 years, and AFAICR every one of
> those files unpacked to make a single directory that was laid out like a VCS
> checkout. Obviously we can change what goes inside, but we should change the
> naming convention at the same time because otherwise we're just going to
> confuse people.

I have no problem with making it clear that "sdist version 2" or
"source wheel" is not the same as a packed VCS checkout. I don't see
the need for a new term, I'd be happy with "<package>-<version>.sdist"
as the name. I'd also like to emphasize strongly that PyPI only hosts
sdists, and *not* source releases - that source releases are typically
only seen in Python in the form of a VCS checkout or development
directory.

(There's an implication there that we need to explore, that pip won't
necessarily gain the ability to be pointed at a non-sdist format
packed "source release" archive, and download it and process it.
That's not a given, but I'd like to be sure we are happy with the
potential re-introduction of confusion over the distinction between a
sdist/source wheel and a source release that would result).

> 2) I think there's a strong case to be made that Python actually needs
> standards for *both* source releases and source wheels. There's certainly no
> logical contradiction -- they're conceptually different things. It sounds
> like we all agree that "pip" should continue to have a way to build and
> install an arbitrary VCS checkout, and extending that standard to cover
> building and installing a classic "source release" would be... almost
> difficult *not* to do.

As noted above, I'm happy for that discussion to occur. But I'm *not*
sure the case is as strong as you think. Technically, it's certainly
not too hard, but the social issues are what concern me. How will we
explain to someone that they can't upload their file to PyPI because
it's a "source release" not a "source wheel"? What is the implication
on people's workflow? How would we explain why people might want to
make "source releases" *at all*? Personally, I can only see a need in
my personal experience for a VCS url that people can clone, a packaged
source artifact that I can upload to PyPI for automatic consumption,
and (binary) wheels. That second item is a "source wheel" - not a
"source release".

> And I think that there will continue to be a clear need for source releases
> even in a world where source wheels exist, because of all those places where
> source releases get used that aren't automagic-easy_install/pip-builds. For
> example, most pure Python packages (the ones that already make "none-any"
> wheels) have no need at all for source wheels, but they still need source
> releases to serve as archival snapshots. And more complex packages that need
> build-time configuration (e.g. numpy) will continue to require source
> releases that can be configured to build wheels that have a variety of
> different properties (e.g., different dependency metadata), so they can't
> get by with source wheels alone -- but you can imagine that such projects
> might reasonably *in addition* provide a source wheel that locks down the
> same default configuration that gets used for their uploaded binary wheel
> builds, and is designed for pip to use when trying to resolve dependencies
> on platforms where a regular binary wheel is unavailable.
>
> Pictorially, this world would look like:
>
> VCS checkout -> source release
>      \              \
>       --------------------+--> in-place install
>                           |
>                           +--> wheels -> install
>                           |
>                           +--> source wheels -> wheels -> install

I don't see the need for source releases that you do. That's likely
because I don't deal with the sorts of complex projects you do,
though, so I'm not dismissing the issue. As I say, my objections are
mostly non-technical. I do think you should consider how to document
"what a source release is intended to achieve" in a way that explains
it to people who don't need the complexity it adds - and with the
explicit goal of making sure that you dissuade people who *don't* need
source releases from thinking they do.

> 3) It sounds like we all agree that
>   - 'pip install <VCS checkout>' should work

Yes.

>   - that there is some crucial metadata that VCS checkouts won't be able to
> provide without running arbitrary code (e.g. dependencies and version
> numbers)

I'm still resisting this one, although I can live with "Nathanial
tells me so" :-)

>   - that what metadata they do provide (e.g., which arbitrary code to run)
> should be specified in a human-friendly configuration file

I don't agree to that one particularly, in the sense that I don't
really care. I'd be happy with a system that said something like that
for a VCS checkout, pip expects "setup.py egg-info" and "setup.py
sdist" to work, and produce respectively a set of static metadata in a
known location, and a properly formatted "source wheel"/sdist file.
Non-distutils build tools can write a wrapper setup.py that works
however they prefer. (That's roughly what we have now, BTW).

> Given this, in the big picture it sounds like the only really essentially
> controversial things about the original proposal might be:
>   - that 'pip install <tarball of VCS checkout>' should work the same as
> 'pip install <VCS checkout>' (does anyone actually disagree?)

Yes, to the extent that I want to ensure it's clearly documented how,
and why, this differs from a sdist/source wheel. But that's not a
technical issue.

>   - the 1 paragraph describing the "transitional plan" allowing pip to
> automatically install from these source releases *as part of a dependency
> resolution plan* (as opposed to when a VCS-checkout-or-equivalent is
> explicitly given as an install target). Which honestly I don't like either,
> it's just there as a backcompat measure so that these source releases don't
> create a regression versus existing sdists -- note that one of the goals of
> the design was that current sdists could be upgraded into this format by
> dropping in a single static file (or by an install tool "virtually" dropping
> in this file when it encounters an old-style sdist -- so you don't need to
> keep around code to handle both cases separately).

Given that these source releases won't be hosted on PyPI (as the
proposal currently stands) there's no real need for this - all you
need to say is that you can point pip at any old URL and take your
chances :-)

> Does that sound right?

Not really. For me the big controversy is whether we move forward from
where we are with sdists, or we ignore the current sdist mechanism and
start over.

A key further question, which I don't think has been stated explicitly
until I started this email, is what formats will be supported for
hosting on PyPI. I am against hosting formats that don't support
static metadata, such as your "source distribution", as I don't see
how PyPI would be able to publish the metadata if it weren't static.

And following on from that, we need to agree whether the key formats
should be required to have a static version. I'm OK with a VCS
checkout having a dynamically generated version, that's part of the
"all bets are off" contract over such things (if you don't generate a
version that reflects every change, you get to deal with the
consequences) but I don't think that's a reasonable thing to allow in
"published" formats.

> (Other features of the original proposal include stuff like the lack of
> trivial metadata like "name" and "description", and the support for
> generating multiple wheels from one directory. I am explicitly calling these
> "inessential".)

Not sure what you mean by lack of name/description being
"inessential". The double negative confuses me. Do you mean you're OK
with requiring them? Fair enough.

For multiple wheels, I'd tend to consider the opposite to be true -
it's not that the capability is non-essential, but rather that in
published formats (source wheel and later in the chain) it's essential
that one source generates one target.

Paul


More information about the Distutils-SIG mailing list