[Distutils] Proposed new manifest scheme

Greg Ward gward@cnri.reston.va.us
Sun, 6 Feb 2000 10:55:00 -0500


Hi all --

right up at the top of my list of things to do for the Distutils is "fix
the dist command".  The main problems are that the syntax for the
MANIFEST file (where you specify the files to go in a source
distribution) is a bit goofy, and there's poor feedback about which
files will actually be included.

Part 1 of the fix is to rearrange how the "dist" command works.  I've
worked out a solution that appeals to me; give this a read and see if it
sounds right to you.

Step 1 (optional):
  Module developer creates MANIFEST.in file.  This is a concise-but-
  readable specification of the files to be included in the source
  distribution with wildcards, directory recursion, and exclusion.
  (Syntax to be worked out in a future post; it should have the same
  features as the current syntax, but be more readable.)

Step 2 (optional):
  Developer runs "dist" command with "--manifest-only" option.
  Distutils parses MANIFEST.in and spits out MANIFEST, containing a
  complete, explicit list of every file to be included in the source
  distribution.

  If MANIFEST.in doesn't exist -- ie. the developer skipped step 1 --
  Distutils creates a MANIFEST using the "default fileset", mostly the
  pure Python modules and extension module source files mentioned in
  setup.py.

  Possibly: print a warning for every file in the source tree that
  *won't* be included in the source distribution; or write this
  information to another file (with a "files were excluded: see ..."
  warning).

Step 3:
  Developer runs "dist" command.  If MANIFEST doesn't exist or is
  out-of-date (ie. developer skipped step 2, or edited MANIFEST.in since
  running step 2), it is regenerated.  Distutils then creates a source
  distibution (tarball, zipfile, whatever) containing exactly the files
  listed in MANIFEST.
        
  Possibly *this* is the step where the warning about excluded files
  should be generated.

Open issues:

  * Obviously we should regenerate MANIFEST whenever MANIFEST.in is
    updated.  (I.e. MANIFEST is only auto-generated, never edited --
    unless you like playing with fire.)  But what about when the
    filesystem changes?  If I add or remove files or directories,
    MANIFEST should be regenerated.  I seem to recall that detecting
    this in a portable way is pretty much impossible, especially in
    the presence of network filesystems (NFS, SMB).

    Two possible solutions: regenerate MANIFEST every time the "dist"
    command is run, or add a "--force-manifest" option.  I prefer the
    latter because of the (presumed) expense of walking the directory
    tree.

  * Should MANIFEST be included with the source distribution?  What
    about MANIFEST.in?  The only reason I can see for including MANIFEST
    is that it would enable trivial integrity checking; it buys you
    zero security-wise, but at least ensures the download wasn't
    truncated.  I think this is a bogus argument; a truncated ZIP or
    .tar.gz file simply won't unpack without errors, so I don't see
    a big need for checking for the presence of all files.  A less
    trivial integrity check could be done by adding MD5 signatures (or
    something) to MANIFEST, but that would all-but-require regenerating
    the damn thing every time the "dist" command is run.

    Including MANIFEST.in is more defensible; it's one of the "source"
    files the developer uses to maintain the distribution, and third
    parties should be able to regenerate the author's source
    distribution if needed.  (Just following the letter of the free
    software licences: forking should be an option, but hopefully not a
    commonly used one!)

  * When should the developer be warned about files in his development
    tree that weren't picked up by the MANIFEST-generating scan?
    If we do it when the MANIFEST is generated (step 2 above), that's
    when the developer is most likely to be watching -- why would
    you bother doing an explicit separate MANIFEST generation if you're
    not going to watch it closely?

    However, doing it as late as possible -- ie. when the MANIFEST is
    read and the source distribution is generated -- maximizes the
    chance of spotting any late additions not caught in MANIFEST.in's
    net.  At the very least, this will remind the developer to re-run
    "dist --force-manifest", and it might well remind him to edit his
    MANIFEST.in to catch the late additions.

  * Speaking of warning about files not included: there should be a way
    to say, "Don't warn me about excluding X".  X could be *~, *.bak,
    *.o, or whatever (depending on your platform, editor, compiler,
    development style, ...).  One possibility: any file explicitly
    excluded by MANIFEST.in would be exempt from "not included"
    warnings.  This would require you to explicitly exclude *~, *.o,
    etc. -- normally they would not be caught in MANIFEST.in's web at
    all, so no need to exclude them.  Alternately, we could add a bit
    more syntax to MANIFEST.in that says "Don't necessarily include or
    exclude this file, but don't warn me if it happens to be excluded".
    I don't see any need for this, and it could be confusing.  Thoughts?

  * Should the "default fileset" be included even when you have a
    MANIFEST.in?  This seems obvious to me, but others (hi Fred!) have
    disagreed in the past.  Since you can always exclude files from
    the default set (a feature of the present syntax that will be
    included in any future syntax), I see no need to make the great
    utility of a default fileset disappear just because you need to
    distribute files *not* mentioned in setup.py.

Coming soon: proposed new syntax for the MANIFEST.in file.

        Greg