[Distutils] MANIFEST destiny :)

Phillip J. Eby pje at telecommunity.com
Wed Nov 16 04:26:48 CET 2005

I originally added CVS and Subversion support to setuptools in order to get 
past the pain of the distutils' MANIFEST system.  I used to use 
MANIFEST.in, but it was a royal pain to get right, and I pretty much always 
forgot to add stuff to it.  The most common problem when I shipped a source 
distribution was that the MANIFEST was screwed up, such that a CVS checkout 
worked fine but a source distribution would break.  Ugh.

So, the CVS/Subversion support for setuptools automatically makes your 
MANIFEST include anything under revision control, whether you have a 
MANIFEST.in or not.  If you don't have a MANIFEST.in, the MANIFEST is built 
every time you run an sdist, so it's always up-to-date.  Ah, bliss!

But all is not happy in MANIFESTville.  It turns out that the bdist_rpm 
command expects to build an sdist, for reasons impenetrable to me.  If you 
build an RPM from a source checkout, everything is fine because setuptools 
can auto-discover your files and build the MANIFEST for the new sdist.  But 
if you build an RPM *from* an sdist, it's a no-go.

In addition, many folks have been asking for this autodetection to cover 
package data files as well.  Why, they reasonably ask, must I specify each 
and every file to be included in a package, when the system already knows 
what files I have in revision control, or which is covered by my MANIFEST.in?

The reason I've been avoiding adding this feature, however, is because of 
the first issue; when you make an sdist, you lose that additional metadata, 
so it would become impossible to build *any* binary from an sdist, not just 
RPMs.  Until recently, that issue seemed insurmountable.

So today, after looking over the issue a bit, I think I have a plan for 
dealing with MANIFEST:

* Change the MANIFEST format to be platform-independent (currently it 
contains OS-specific path separators)

* Always, always, always build MANIFEST, and always include both the 
MANIFEST file and MANIFEST.in (if present) in the source distribution.

* Disable all the options that allow user control over MANIFEST generation, 
including pruning, defaults, changing the filenames, etc.

* Use the MANIFEST data (along with revision control info) not only for 
producing source distributions, but also to determine what files should be 
considered "package data", if the user passes an 
'include_package_data=True' keyword to setup().

The net result would be a single source for what constitutes "the 
distribution contents", in the sense of files that are not directly part of 
the distutils build process.  For files that are built automatically in 
some way but should be included in source distributions or as package data, 
you would still have to put them in MANIFEST.in.  But anything that was 
under CVS or Subversion would be handled automatically, and you wouldn't 
have to duplicate data between MANIFEST.in and setup(package_data={...}).

I'm also thinking that most of the MANIFEST logic could and should move to 
the Distribution class, since the data will be used by multiple 
commands.  Thus, the sdist command could just ask the Distribution for the 
MANIFEST and get it, as would the commands that copy package data files to 
the build directory.

I suspect the most controversial parts of this idea are:

* Disabling all user control of MANIFEST
* Forcibly including MANIFEST and MANIFEST.in in source distributions
* Making MANIFEST be always platform-independent

When Googling the issues around MANIFEST, I noticed that the idea of having 
MANIFEST or MANIFEST.in included automatically has been repeatedly shot 
down here over the years.  However, if I followed the logic put forth on 
those occasions, I would never have implemented revision control support in 
the first place, so I guess if I'm in for a penny, I might as well be in 
for a pound, as they say.

I couldn't find any argument one way or the other about the 
manifest-generation options, nor any reasons why MANIFEST needs to remain 
platform-specific, so I presume the options are just YAGNI and the format 
was just an implementation accident.

Likewise, as far as I can tell there is no reason for *not* regenerating a 
MANIFEST whenever you need one, so the current behavior of only building 
one when MANIFEST.in changes or you use --force-manifest, seems like a 
premature optimization.  Or maybe it wasn't an excessive optimization when 
the distutils were created, but it's not as if it's going to save you much 
time compared to the actual archive building process today.

I'm thinking that basically --force-manifest would become a no-op in 
setuptools, in the sense that you won't be able to *stop* the MANIFEST from 
being built every single time.  --manifest-only would still be 
possible.  --manifest and --template would have to be rejected, however, 
because the standard name is needed for MANIFEST to be re-read when you 
build stuff from the produced sdist.

--no-defaults would be ignored, except for a warning.  If you don't want 
the defaults, you can always start your MANIFEST.in with an exclude pattern 
to exclude absolutely everything already included.  There shouldn't be two 
ways to do the same thing, especially not one that you can use on the 
command line to mess things up in a non-repeatable fashion!  Likewise 
--no-prune, because that's a similar recipe for disaster.

A lot of these ideas are potential backward compatibility problems, so 
we'll have to see how they play out in setuptools before considering them 
for addition to the distutils.  My guess, however, is that most prolific 
Python developers want to spend their time writing code, not writing and 
debugging MANIFEST.in files, and that fact has been responsible for a lot 
of setuptools uptake so far.  I've been seeing a lot of projects that use 
setuptools for no apparent reason other than it makes writing the setup 
script a little easier, due to find_packages(), package_data, and the lack 
of need for a MANIFEST when source control is involved.  These are 
qualities I'd like to extend further, even at the cost of some flexibility.

Heck, most of the distutils' flaws lie in their extreme versatility.  You 
can tell each individual command that it's using different build or 
distribution directories, for example, and in the process completely foul 
up your builds.  What's more, every distutils tutorial may well end up 
giving people different instructions as to the "best" way to lay out a 
project directory.  If there's ever a "distutils 2", it needs to become 
dictator-ware and tell you exactly what the One Obvious Way is.  If 
everything *had* to be a particular way, then changing how the distutils 
work would actually be possible, whereas now, it's bloody hard to even 
figure out which of the nine billion ways to do it are actually in use.

Okay, off the soapbox now.  :)  Does anybody see any issues with this that 
I'm missing, with respect to using the MANIFEST/FileList machinery to 
control sdist and package data, or my implementation plans for doing 
so?  Thanks.

More information about the Distutils-SIG mailing list