Package DB: strawman PEP

Daniel Berlin dan at
Mon Jul 9 09:17:35 CEST 2001

Andrew Kuchling <akuchlin at> writes:

>     XXX how much do we care about performance?  Do we really need to
>     use an anydbm file or something similar?
>     XXX is the actual filename important?  Let's say the installation
>     data for PIL is in the file INSTALLDB/Numeric.  Is this OK?  When
>     we want to figure out if Numeric is installed, do we want to open
>     a single file, or have to scan them all?  Note that for
>     human-interface purposes, we'll often have to scan all the
>     packages anyway, for a case-insensitive or keyword search.

Errr, how so?
If you mean keyword/case-insensitive on package description/name,
just keep an anydbm whose sole purpose is to list the installed
packages and descriptions, and nothing else.
Then you could just search all they keys, rather than have to open and
close 600 files or whatever. Every time (if you cached it in memory, instead of
doing it every time,  you'd have an in memory version of this anydbm
database, and if you go that far, you might as well just store it in the database).

This can easily be rebuilt in case of corruption, since *it* was
generated initially by scanning the files (you just keep it up to date
when you install packages, and if something gets out of whack,
completely rebuild it).

> Database Contents
>     Each file in $INSTALLDB or its subdirectories describes a single
>     package, and has the following contents:
>         An initial line listing the sections in this file, separated
>         by whitespace.  Currently this will always be 'PKG-INFO
>         FILES'.  This is for future-proofing; if we add a new section,
>         for example to list documentation files, then we'd add a DOCS
>         section and list it in the contents.  Sections are always
>         separated by blank lines.  XXX too simple?

Can we try to keep delimiters the same where possible (IE why use
whitespace here, but tabs for file entries)?  

>         [PKG-INFO section] An initial set of RFC-822 headers
>         containing the package information for a file, as described in
>         PEP 241, "Metadata for Python Software Packages".
>         A blank line indicating the end of the PKG-INFO section.
>         An entry for each file installed by the package.  
>         XXX Are .pyc and .pyo files in this list?  What about compiled
>         .so files?  AMK thinks "no" and "yes", respectively.
>     Each file's entry is a single tab-delimited line that contains the
>     following fields: 
>     XXX should each file entry be all on one line and
>     tab-delimited?  More RFC-822 headers?  AMK thinks tab-delimited
>     seems sufficent.
>         * The file's size
>         * XXX do we need to store permissions?  The owner/group?  
owner/group will get you in trouble if the owner/group doesn't exist
on the other system.
>         * An MD5 digest of the file, written in hex.  (XXX All 16
>           bytes of the digest seems unnecessary; first 8 bytes only,
>           maybe?  Is a zlib.crc32() hash sufficient?)
We should use either MD5 or sha1.
MD5 can be done at 10 meg a second (easily) on a pentium 90. So it's
fast enough.  8 bytes is fine enough, just make it the first 8.
crc32 will probably not be faster (unless you have *huge* .py files,
you are going to be dominated here by open/close/seek time, not
read/checksum calculation time)
>         * The file's full path, as installed on the system.  (XXX
>           should it be relative to sys.prefix, or sys.prefix +
>           '/lib/python<version>?'  If so, full paths are still needed;
>           consider a package that installs a startup script such as
>           /etc/init.d/zope)
>         * XXX some sort of type indicator, to indicate whether this is
>           a Python module, binary module, documentation file, config
>           file?  Do we need this?
Well, it would allow you to purge docs seperately from the rest of the
package, or purge everything but the config files. I've used both
options myself. The other option is to put docs in a seperate package.

>     A package that uses the Distutils for installation will
>     automatically update the database.  Packages that roll their own
>     installation 
>     XXX what's the relationship between this database and the RPM or
>     DPKG database?  I'm tempted to make the Python database completely
>     optional; a distributor can preserve the interface of the package
>     management tool and replace it with their own wrapper on top of
>     their own package manager.  (XXX but how would the Distutils know
>     that, and not bother to update the Python database?)

Well, why don't we email the package owners for a few distributions,
and see what they think?

"My dental hygienist is cute.  Every time I visit, I eat a whole
package of Oreo cookies while waiting in the lobby.  Sometimes
she has to cancel the rest of the afternoon's appointments.
"-Steven Wright

More information about the Python-list mailing list