[Distutils] Re: Package DB: strawman PEP

11 Jul 2001

      ...
PEP: XXX
Title: A Database of Installed Python Packages
Version: $Revision: 1.1 $
Author: A.M. Kuchling <akuchlin@mems-exchange.org>
Type: Standards Track
Created: 08-Jul-2001
Status: Draft
Post-History:
Introduction
This PEP describes a format for a database of Python packages
    installed on a system.
Requirements
We need a way to figure out what packages, and what versions of
    those packages, are installed on a system.  We want to provide
    features similar to CPAN, APT, or RPM.  Required use cases that
    should be supported are:
* Is package X on a system?
        * What version of package X is installed?
        * Where can the new version of package X be found?
          XXX Does this mean "a home page where the user can go and
          find a download link", or "a place where a program can find
          the newest version?"  Perhaps both...
Both, but I would expect that presenting the user with a URL would be
a rare case (perhaps a commandline option).
...
* What files did package X put on my system?
        * What package did the file x/y/z.py come from?
        * Has anyone modified x/y/z.py locally?
* dependencies !
...
Database Location
The database lives in a bunch of files under
s/bunch of files/file/

filesystem access is just too slow, IMHO
...
<prefix>/lib/python<version>/install/.  This location will be
    called INSTALLDB through the remainder of this PEP.
XXX is that a good location?  What effect does platform-dependent code
    vs. platform-independent code have on this?
I don't think it is necessary to tie the DB to a specific version of
Python, or have multiple DBs if multiple versions of Python are
installed.

You can get what (I think) you are after by considering that each
package will either depend on "python" or "python-<version>"; when it
comes to installing or removing a package you either do the operation
over all installed Pythons, or just the version the package depends
on.
...
The structure of the database is deliberately kept simple; each
    file in this directory or its subdirectories (if any) describes a
    single package.
The rationale for scanning subdirectories is that we can move to a
    directory-based indexing scheme if the package directory contains
    too many entries.  That is, instead of $INSTALLDB/Numeric, we
    could switch to $INSTALLDB/N/Nu/Numeric or some similar scheme.
XXX how much do we care about performance?  Do we really need to
    use an anydbm file or something similar?
enough not to use a file system based INSTALLDB  :)

The DB should be human readable, and easy to modify without needing
special tools (saved my butt a few times with Debian's system).

You really need two databases...
one for what is available,
one for what is installed.

If you don't do a local available DB, the system is useless without a
'net connection; if you keep an available list and flag what is
installed, you have piles of data to read through everytime you want
to inspect or display what installed is on the system.
...
XXX is the actual filename important?  Let's say the installation
    data for PIL is in the file INSTALLDB/Numeric.  Is this OK?  When
    we want to figure out if Numeric is installed, do we want to open
    a single file, or have to scan them all?  Note that for
    human-interface purposes, we'll often have to scan all the
    packages anyway, for a case-insensitive or keyword search.
...best done (speed and efficiency wise) with a file based, rather
than filesystem based, database(s), right?
...
Database Contents
Each file in $INSTALLDB or its subdirectories describes a single
    package, and has the following contents:
An initial line listing the sections in this file, separated
        by whitespace.  Currently this will always be 'PKG-INFO
        FILES'.  This is for future-proofing; if we add a new section,
        for example to list documentation files, then we'd add a DOCS
        section and list it in the contents.  Sections are always
        separated by blank lines.  XXX too simple?
Too complicated.
...
[PKG-INFO section] An initial set of RFC-822 headers
        containing the package information for a file, as described in
        PEP 241, "Metadata for Python Software Packages".
A blank line indicating the end of the PKG-INFO section.
An entry for each file installed by the package.
        XXX Are .pyc and .pyo files in this list?  What about compiled
        .so files?  AMK thinks "no" and "yes", respectively.
Ya.  You need a list of what is in the package, the list of generated
files can always be generated again.
...
Each file's entry is a single tab-delimited line that contains the
    following fields:
    XXX should each file entry be all on one line and
    tab-delimited?  More RFC-822 headers?  AMK thinks tab-delimited
    seems sufficent.
        * The file's size
Why, just look at the file itself.
...
* XXX do we need to store permissions?  The owner/group?
No, this is a function of the OS and sysadmin's preferences; and it
only affects the run-time, not installation or removal.
...
* An MD5 digest of the file, written in hex.  (XXX All 16
          bytes of the digest seems unnecessary; first 8 bytes only,
          maybe?  Is a zlib.crc32() hash sufficient?)
What is the point of using only part of the MD5 digest?  How are you
supposed to tell if the file has changed (corrupted, trojaned or
tweaked) if you don't have a "fingerprint" of some sort (and if
8-bytes was good enough for that purpose, why does MD5 use 16)?
...
* The file's full path, as installed on the system.  (XXX
          should it be relative to sys.prefix, or sys.prefix +
          '/lib/python<version>?'  If so, full paths are still needed;
          consider a package that installs a startup script such as
          /etc/init.d/zope)
I think this is going down the wrong track, and it will take someone
familiar with Debian to construct a proper init script for me...
[more later]
...
* XXX some sort of type indicator, to indicate whether this is
          a Python module, binary module, documentation file, config
          file?  Do we need this?
Yes, for config files and docs at least.
...
A package that uses the Distutils for installation will
    automatically update the database.  Packages that roll their own
    installation
XXX what's the relationship between this database and the RPM or
    DPKG database?  I'm tempted to make the Python database completely
    optional; a distributor can preserve the interface of the package
    management tool and replace it with their own wrapper on top of
    their own package manager.  (XXX but how would the Distutils know
    that, and not bother to update the Python database?)
...[the more] The install path will be dependent on the target
system.  Packages should be in a platform independent format, and
contain enough meta-data that they can be converted to the native
package format (I don't think distutils modules have enough meta-data
to accomplish this, but haven't been keeping up with it or PEP241).

Don't worry about the installation phase; let distutils convert from a
generic format (if necessary), then hand it off to dpkg, rpm, etc, or
install it itself for those that do not have a native package manager
yet (do a good enough job of this and it may even become the de facto
standard on those platforms).

I have an idea for this aspect... I'm gonna rewriting Debian's dh_make
(template based debianizing tool), moving all the code that extracts
information from the source into the templates themselves (working
title, "izer").  It should be possible to create any format of package
from a distutils module.  See...

	http://packages.debian.org/unstable/devel/dh-make.html

...for more about what this tool does,
and a link to the (perl) source.

The end result being...
The "Python database" will only exist for those platforms that do not
already have a package system, so you would not need to worry about
keeping the two in sync.

- Bruce

[Distutils] Re: Package DB: strawman PEP

Bruce Sass