Package DB: strawman PEP
It seems time to bite the bullet and actually begin designing and implementing a database of installed packages. As a strawman to get a focused discussion started, here's a draft of a PEP, with lots of XXX's in it. Followups to the Distutils SIG, please. --amk PEP: XXX Title: A Database of Installed Python Packages Version: $Revision: 1.1 $ Author: A.M. Kuchling <akuchlin@mems-exchange.org> Type: Standards Track Created: 08-Jul-2001 Status: Draft Post-History: Introduction This PEP describes a format for a database of Python packages installed on a system. Requirements We need a way to figure out what packages, and what versions of those packages, are installed on a system. We want to provide features similar to CPAN, APT, or RPM. Required use cases that should be supported are: * Is package X on a system? * What version of package X is installed? * Where can the new version of package X be found? XXX Does this mean "a home page where the user can go and find a download link", or "a place where a program can find the newest version?" Perhaps both... * What files did package X put on my system? * What package did the file x/y/z.py come from? * Has anyone modified x/y/z.py locally? Database Location The database lives in a bunch of files under <prefix>/lib/python<version>/install/. This location will be called INSTALLDB through the remainder of this PEP. XXX is that a good location? What effect does platform-dependent code vs. platform-independent code have on this? The structure of the database is deliberately kept simple; each file in this directory or its subdirectories (if any) describes a single package. The rationale for scanning subdirectories is that we can move to a directory-based indexing scheme if the package directory contains too many entries. That is, instead of $INSTALLDB/Numeric, we could switch to $INSTALLDB/N/Nu/Numeric or some similar scheme. XXX how much do we care about performance? Do we really need to use an anydbm file or something similar? XXX is the actual filename important? Let's say the installation data for PIL is in the file INSTALLDB/Numeric. Is this OK? When we want to figure out if Numeric is installed, do we want to open a single file, or have to scan them all? Note that for human-interface purposes, we'll often have to scan all the packages anyway, for a case-insensitive or keyword search. Database Contents Each file in $INSTALLDB or its subdirectories describes a single package, and has the following contents: An initial line listing the sections in this file, separated by whitespace. Currently this will always be 'PKG-INFO FILES'. This is for future-proofing; if we add a new section, for example to list documentation files, then we'd add a DOCS section and list it in the contents. Sections are always separated by blank lines. XXX too simple? [PKG-INFO section] An initial set of RFC-822 headers containing the package information for a file, as described in PEP 241, "Metadata for Python Software Packages". A blank line indicating the end of the PKG-INFO section. An entry for each file installed by the package. XXX Are .pyc and .pyo files in this list? What about compiled .so files? AMK thinks "no" and "yes", respectively. Each file's entry is a single tab-delimited line that contains the following fields: XXX should each file entry be all on one line and tab-delimited? More RFC-822 headers? AMK thinks tab-delimited seems sufficent. * The file's size * XXX do we need to store permissions? The owner/group? * An MD5 digest of the file, written in hex. (XXX All 16 bytes of the digest seems unnecessary; first 8 bytes only, maybe? Is a zlib.crc32() hash sufficient?) * The file's full path, as installed on the system. (XXX should it be relative to sys.prefix, or sys.prefix + '/lib/python<version>?' If so, full paths are still needed; consider a package that installs a startup script such as /etc/init.d/zope) * XXX some sort of type indicator, to indicate whether this is a Python module, binary module, documentation file, config file? Do we need this? A package that uses the Distutils for installation will automatically update the database. Packages that roll their own installation XXX what's the relationship between this database and the RPM or DPKG database? I'm tempted to make the Python database completely optional; a distributor can preserve the interface of the package management tool and replace it with their own wrapper on top of their own package manager. (XXX but how would the Distutils know that, and not bother to update the Python database?) Deliverables Patches to the Distutils that 1) implement a InstallationDatabase class, 2) Update the database when a new package is installed. 3) a simple package management tool, features to be added to this PEP. (Or a separate PEP?) References [1] Michael Muller's patch (posted to the Distutils-SIG around 28 Dec 1999) generates a list of installed files. Acknowledgements Ideas for this PEP originally came from postings by Greg Ward, Fred Drake, Mats Wichmann, and others. Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil End:
Andrew Kuchling <akuchlin@mems-exchange.org> writes:
XXX how much do we care about performance? Do we really need to use an anydbm file or something similar?
XXX is the actual filename important? Let's say the installation data for PIL is in the file INSTALLDB/Numeric. Is this OK? When we want to figure out if Numeric is installed, do we want to open a single file, or have to scan them all? Note that for human-interface purposes, we'll often have to scan all the packages anyway, for a case-insensitive or keyword search.
Errr, how so? If you mean keyword/case-insensitive on package description/name, just keep an anydbm whose sole purpose is to list the installed packages and descriptions, and nothing else. Then you could just search all they keys, rather than have to open and close 600 files or whatever. Every time (if you cached it in memory, instead of doing it every time, you'd have an in memory version of this anydbm database, and if you go that far, you might as well just store it in the database). This can easily be rebuilt in case of corruption, since *it* was generated initially by scanning the files (you just keep it up to date when you install packages, and if something gets out of whack, completely rebuild it).
Database Contents
Each file in $INSTALLDB or its subdirectories describes a single package, and has the following contents:
An initial line listing the sections in this file, separated by whitespace. Currently this will always be 'PKG-INFO FILES'. This is for future-proofing; if we add a new section, for example to list documentation files, then we'd add a DOCS section and list it in the contents. Sections are always separated by blank lines. XXX too simple?
Can we try to keep delimiters the same where possible (IE why use whitespace here, but tabs for file entries)?
[PKG-INFO section] An initial set of RFC-822 headers containing the package information for a file, as described in PEP 241, "Metadata for Python Software Packages".
A blank line indicating the end of the PKG-INFO section.
An entry for each file installed by the package. XXX Are .pyc and .pyo files in this list? What about compiled .so files? AMK thinks "no" and "yes", respectively.
Each file's entry is a single tab-delimited line that contains the following fields: XXX should each file entry be all on one line and tab-delimited? More RFC-822 headers? AMK thinks tab-delimited seems sufficent.
* The file's size
* XXX do we need to store permissions? The owner/group?
owner/group will get you in trouble if the owner/group doesn't exist on the other system.
* An MD5 digest of the file, written in hex. (XXX All 16 bytes of the digest seems unnecessary; first 8 bytes only, maybe? Is a zlib.crc32() hash sufficient?)
We should use either MD5 or sha1. MD5 can be done at 10 meg a second (easily) on a pentium 90. So it's fast enough. 8 bytes is fine enough, just make it the first 8. crc32 will probably not be faster (unless you have *huge* .py files, you are going to be dominated here by open/close/seek time, not read/checksum calculation time)
* The file's full path, as installed on the system. (XXX should it be relative to sys.prefix, or sys.prefix + '/lib/python<version>?' If so, full paths are still needed; consider a package that installs a startup script such as /etc/init.d/zope)
* XXX some sort of type indicator, to indicate whether this is a Python module, binary module, documentation file, config file? Do we need this?
Well, it would allow you to purge docs seperately from the rest of the package, or purge everything but the config files. I've used both options myself. The other option is to put docs in a seperate package.
A package that uses the Distutils for installation will automatically update the database. Packages that roll their own installation
XXX what's the relationship between this database and the RPM or DPKG database? I'm tempted to make the Python database completely optional; a distributor can preserve the interface of the package management tool and replace it with their own wrapper on top of their own package manager. (XXX but how would the Distutils know that, and not bother to update the Python database?)
Well, why don't we email the package owners for a few distributions, and see what they think? :) -- "My dental hygienist is cute. Every time I visit, I eat a whole package of Oreo cookies while waiting in the lobby. Sometimes she has to cancel the rest of the afternoon's appointments. "-Steven Wright
[Picking up the package-DB thread from last week] On Mon, Jul 09, 2001 at 03:17:35AM -0400, Daniel Berlin wrote:
This can easily be rebuilt in case of corruption, since *it* was generated initially by scanning the files (you just keep it up to date when you install packages, and if something gets out of whack, completely rebuild it).
The real package database implementation will very likely build a cached version of the data using anydbm or some faster mechanism. I don't think the PEP needs to specify that, though if people disagree it could certainly be added. I
An initial line listing the sections in this file, separated by whitespace. Currently this will always be 'PKG-INFO FILES'. This is for future-proofing; if we add a new section, for example to list documentation files, then we'd add a DOCS section and list it in the contents. Sections are always separated by blank lines. XXX too simple?
...
Each file's entry is a single tab-delimited line that contains the following fields:
Can we try to keep delimiters the same where possible (IE why use whitespace here, but tabs for file entries)?
Fair enough; I was thinking the information for a single file is small enough, and easy enough to get right initially, so it shouldn't need extending later. Perhaps RFC822-style should be reused again to list files: Name: /usr/local/lib/python2.1/site-packages/mymodule.py Mode: 0644 Size: 565 Digest: 01abcdef... Name: /usr/local/lib/python2.1/site-packages/mymodule2.py Mode: 0644 Size: 1024 Digest: 02fedbca... There should be blank lines to mark the end of each file's stanza, but that breaks the idea of using blank lines to indicate the end of a section. Perhaps the 'Files' section should simply always come last, and therefore safely break that convention.
owner/group will get you in trouble if the owner/group doesn't exist on the other system.
Good point.
* XXX some sort of type indicator, to indicate whether this is a Python module, binary module, documentation file, config file? Do we need this?
Well, it would allow you to purge docs seperately from the rest of the package, or purge everything but the config files. I've used both options myself. The other option is to put docs in a seperate package.
Docs in a separate package is unlikely, so I suppose we need to use a type indicator. --amk
On Tue, 17 Jul 2001, Andrew Kuchling wrote:
[Picking up the package-DB thread from last week]
On Mon, Jul 09, 2001 at 03:17:35AM -0400, Daniel Berlin wrote:
This can easily be rebuilt in case of corruption, since *it* was generated initially by scanning the files (you just keep it up to date when you install packages, and if something gets out of whack, completely rebuild it).
The real package database implementation will very likely build a cached version of the data using anydbm or some faster mechanism. I don't think the PEP needs to specify that, though if people disagree it could certainly be added. I
This is starting to bother me even more than it did at first. If Distutils has a package database, and you use it to generate a native binary package, will the generated package update the distutils database, the native package manager's database, both, or neither? There's a whole boatload of problems no matter which answer you pick. The discussions so far have indicated to me that the primary problem is that the Windows platform has no true "native" manager. Does it make sense to provide this as part of Distutils for the sake of one platform, or does it make more sense to standardize on a installer for Windows that provides package management. Distutils bdist_win* packages would then be handled through a consistent interface that would register their contents properly. It could even go further, and properly manage multiple Python installations on the same machine, with different Distutils packages in each tree. Maybe I'm out in left field, here, but what exactly is the problem that this PEP is trying to solve? Every package manager I'm aware of covers all the questions in the "Requirements" statement, with the possible exception of "Where can I download a new version?" I always manage to find a place to stick that info (VENDOR, HOTLINE, or some such), so it _can_ be grafted into the PEP241 metadata usage for binary packaging. This is exactly why I wrote bdist_packager (patch #415226). So that module authors could provide metadata in a single unified manner which Distutils bdist commands can slice and dice appropriately for each supported package manager. Think about this, please. If you are going to continue a python repository of this scale, are you going to discontinue support for native binaries? It just doesn't make sense to do both. mwa
It seems time to bite the bullet and actually begin designing and implementing a database of installed packages. As a strawman to get a focused discussion started, here's a draft of a PEP, with lots of XXX's in it. Followups to the Distutils SIG, please.
it's definitely off to a decent start, i'll throw out some "things to consider"... if a user changes the install prefix and installs the package somewhere "off the path", does the package manager keep track of it? i'm guessing it should only handle packages inside the python treee? it will be hard to add an automatic "download link" for updating packages. what if a package has increased multiple versions since the package was installed? i think this falls into the realm of the nebulous online package library. the good news about going with a central online repository is that you can automatically get new packages, not just update ones you already have. how to handle the python standard library? is there just a single entry for the entire library? does it not matter since the user can just query the python version and get a good idea of what is available? i'm kind of thinking of the "optional" standard runtime libraries. then again, it's easy enough to see if they exist by catching any Import exceptions, and i don't think this sort of thing has been a problem to anyone in the past. no need to kill ourselves handling something that hasn't bothered anyone before. :] i do like the database of installed files. i see it can already be created by distutils with some options, it should mean 'uninstall' tools are pretty much automatic. we'll need some information in the distutils setup.py to define what the dependencies for a package really are. how should this be layed out? package names with a minimum version? what about binary packages. most windows users don't have a compilers, and will need prebuilt extension modules for their system. again, i believe this falls more into the roll of having an online package repository, but it would be worth mentioning in this pep? for the love of all things good, can we please make a recommendation in our PEP that the windows installation location be something other than "C:\PYTHON21"? something like "C:\PYTHON21\SITE-PACKAGES" would be a big improvement. i thought i heard that macpython recently made this "fix", why is the windows version lagging on this? as for some of the questions you posed... i was thinking the name would be important. for each installed package you have a single file with the name of the package (and some appropriate extension maybe). the file would just contain the PKG-INFO with the list of installed files at the end? (is that enough information?). then it becomes simple to see if and what version of a package is installed. perhaps the name should be case-INsensitive. so there is no confusion between "Numeric" and "numeric". things that would take longer with that setup is to find which package a file belongs to, which packages are dependent on a given package, and that sort of stuff. obviously there will need to be some sort of "packages" module in standard python that has functions to perform all these tasks. it should also include a commandline interface so it can be simply accessed by users... (forgive the name, if that sort of thing has been done to death) pyckage.py list #list installed packages pyckage.py update pygame #get latest version of pygame pyckage.py update * #get latest version of all packages pyckage.py showdeps Numeric #you get the idea? anyways, perhaps the PEP should list what functions are available and how they would be accessed on the commandline (or is that too much detail for a PEP?) oh, one last thing to consider... at this point, my package has been put into several different binary packages, RPM and DEB for example. we need to figure something out here. perhaps an installed package can be marked "Not Under Python Package Control", so it can't be removed or updated from the python package stuff. perhaps we could be more flexible and allow the package to embed commands to do these things for it? "uninstall_command" could be set to something like "apt-get remove pygame"? on the other hand, if a really nice package management system is in place, we may no longer need to rely on OS tools to do the packaging for us? hmm, same with windows too. if a user installs a package with the a distutils executable bdist, it adds some entries to the registry to allow the user to uninstall it. if this python package management uninstalls the package, should there be some sort of hook so it can clean out the registry keys too? whew, a lot of things here. perhaps they don't all need to be addressed by the PEP for now, but as long as they are tickling in the back of someone's head as this gets developed i think we'll be in a much better position :]
On Sun, 8 Jul 2001, Andrew Kuchling wrote:
It seems time to bite the bullet and actually begin designing and implementing a database of installed packages. As a strawman to get a focused discussion started, here's a draft of a PEP, with lots of XXX's in it. Followups to the Distutils SIG, please.
I'm confused. Why? What does this give us that native package managers don't. How is it going to keep synchronized with package manager? If I understand correctly, most of what's desired here could be accomplished by including the PEP 241 metadata in the module in a manner that could be inspected upon import. Distutils could even go so far as to create a PKG_INFO subpackage ao you could import mymodule.PKG_INFO. If is succeeds, it's a distutils created module and contains all PEP241 required information.
XXX what's the relationship between this database and the RPM or DPKG database? I'm tempted to make the Python database completely optional; a distributor can preserve the interface of the package management tool and replace it with their own wrapper on top of their own package manager. (XXX but how would the Distutils know that, and not bother to update the Python database?)
If it's optional, then maybe, sometimes, you might have the information that it could have provided you. In the long run, your better off making it simple and painless to create packages for the native system's package manager. Then people need look only one place for everything. (I still like the idea of the 241 metadata being accessible in the module itself. It makes it more convenient for module authors to query information specific to distutils module relationships.) Mark
I'd like to take the opportunity to discuss some of the work done at ActiveState that is pertinent to the discussion at hand. While I suspect most people subscribed to this list are already familiar, I'll assume nothing for clarity sake. In its current form a repository is accessible which allows for easy installation of python packages (see http://www.ActiveState.com/PPMPackages/PyPPM/2.1/ ) Our equivalent of PKG-INFO is a PPD file type that stores data in an XML format. A PPD file describes a PyPPM package which is a built distutils package in an archive format (.tar.gz or .zip depending on platform). While currently packages can only be downloaded, a demo will be presented at the upcoming O'Reilly open source conference of a repository which exposes functionality discussed in the catalog SIG. In addition to the standard metadata handled by distutils (author,author_email,license, etc.) a PPD can contain the following elements: INSTALL - an installation string for installing packages distributed in a non-distutils binary distribution format (eg. RPM, setup.exe etc.) DEPENDENCIES - handles dependencies (eg. PIL-Graph's dependence on PIL) IMPLEMENTATION - an implementation may contain a PYTHONCORE, ARCHITECTURE, and OS elements, in addition to a CODEBASE element. The CODEBASE element defines the URL of the actual package, while the others define the relevant implementation characteristics of said package When a PyPPM package is installed on a client machine package instance data is stored in an XML database. In its current form, the database stores configuration information for the PyPPM client (eg. a repository location is specified), and descriptions of packages installed (essentially all the information contained in a PPD minus implementation data for other platforms). Additionally storing each file installed within a package would allow for provision of the functionality suggested (and sorely needed :)) in Andrew's PEP. I would argue that adopting an XML format would be beneficial, unless performance is deemed a major issue, however I suspect people don't want to rehash PPD vs. PKG-INFO type arguments
PEP: XXX Title: A Database of Installed Python Packages Version: $Revision: 1.1 $ Author: A.M. Kuchling <akuchlin@mems-exchange.org> Type: Standards Track Created: 08-Jul-2001 Status: Draft Post-History:
Introduction
This PEP describes a format for a database of Python packages installed on a system.
Requirements
We need a way to figure out what packages, and what versions of those packages, are installed on a system. We want to provide features similar to CPAN, APT, or RPM. Required use cases that should be supported are:
* Is package X on a system? * What version of package X is installed? * Where can the new version of package X be found? XXX Does this mean "a home page where the user can go and find a download link", or "a place where a program can find the newest version?" Perhaps both...
Both, but I would expect that presenting the user with a URL would be a rare case (perhaps a commandline option).
* What files did package X put on my system? * What package did the file x/y/z.py come from? * Has anyone modified x/y/z.py locally?
* dependencies !
Database Location
The database lives in a bunch of files under
s/bunch of files/file/ filesystem access is just too slow, IMHO
<prefix>/lib/python<version>/install/. This location will be called INSTALLDB through the remainder of this PEP.
XXX is that a good location? What effect does platform-dependent code vs. platform-independent code have on this?
I don't think it is necessary to tie the DB to a specific version of Python, or have multiple DBs if multiple versions of Python are installed. You can get what (I think) you are after by considering that each package will either depend on "python" or "python-<version>"; when it comes to installing or removing a package you either do the operation over all installed Pythons, or just the version the package depends on.
The structure of the database is deliberately kept simple; each file in this directory or its subdirectories (if any) describes a single package.
The rationale for scanning subdirectories is that we can move to a directory-based indexing scheme if the package directory contains too many entries. That is, instead of $INSTALLDB/Numeric, we could switch to $INSTALLDB/N/Nu/Numeric or some similar scheme.
XXX how much do we care about performance? Do we really need to use an anydbm file or something similar?
enough not to use a file system based INSTALLDB :) The DB should be human readable, and easy to modify without needing special tools (saved my butt a few times with Debian's system). You really need two databases... one for what is available, one for what is installed. If you don't do a local available DB, the system is useless without a 'net connection; if you keep an available list and flag what is installed, you have piles of data to read through everytime you want to inspect or display what installed is on the system.
XXX is the actual filename important? Let's say the installation data for PIL is in the file INSTALLDB/Numeric. Is this OK? When we want to figure out if Numeric is installed, do we want to open a single file, or have to scan them all? Note that for human-interface purposes, we'll often have to scan all the packages anyway, for a case-insensitive or keyword search.
...best done (speed and efficiency wise) with a file based, rather than filesystem based, database(s), right?
Database Contents
Each file in $INSTALLDB or its subdirectories describes a single package, and has the following contents:
An initial line listing the sections in this file, separated by whitespace. Currently this will always be 'PKG-INFO FILES'. This is for future-proofing; if we add a new section, for example to list documentation files, then we'd add a DOCS section and list it in the contents. Sections are always separated by blank lines. XXX too simple?
Too complicated.
[PKG-INFO section] An initial set of RFC-822 headers containing the package information for a file, as described in PEP 241, "Metadata for Python Software Packages".
A blank line indicating the end of the PKG-INFO section.
An entry for each file installed by the package. XXX Are .pyc and .pyo files in this list? What about compiled .so files? AMK thinks "no" and "yes", respectively.
Ya. You need a list of what is in the package, the list of generated files can always be generated again.
Each file's entry is a single tab-delimited line that contains the following fields: XXX should each file entry be all on one line and tab-delimited? More RFC-822 headers? AMK thinks tab-delimited seems sufficent. * The file's size
Why, just look at the file itself.
* XXX do we need to store permissions? The owner/group?
No, this is a function of the OS and sysadmin's preferences; and it only affects the run-time, not installation or removal.
* An MD5 digest of the file, written in hex. (XXX All 16 bytes of the digest seems unnecessary; first 8 bytes only, maybe? Is a zlib.crc32() hash sufficient?)
What is the point of using only part of the MD5 digest? How are you supposed to tell if the file has changed (corrupted, trojaned or tweaked) if you don't have a "fingerprint" of some sort (and if 8-bytes was good enough for that purpose, why does MD5 use 16)?
* The file's full path, as installed on the system. (XXX should it be relative to sys.prefix, or sys.prefix + '/lib/python<version>?' If so, full paths are still needed; consider a package that installs a startup script such as /etc/init.d/zope)
I think this is going down the wrong track, and it will take someone familiar with Debian to construct a proper init script for me... [more later]
* XXX some sort of type indicator, to indicate whether this is a Python module, binary module, documentation file, config file? Do we need this?
Yes, for config files and docs at least.
A package that uses the Distutils for installation will automatically update the database. Packages that roll their own installation
XXX what's the relationship between this database and the RPM or DPKG database? I'm tempted to make the Python database completely optional; a distributor can preserve the interface of the package management tool and replace it with their own wrapper on top of their own package manager. (XXX but how would the Distutils know that, and not bother to update the Python database?)
...[the more] The install path will be dependent on the target system. Packages should be in a platform independent format, and contain enough meta-data that they can be converted to the native package format (I don't think distutils modules have enough meta-data to accomplish this, but haven't been keeping up with it or PEP241). Don't worry about the installation phase; let distutils convert from a generic format (if necessary), then hand it off to dpkg, rpm, etc, or install it itself for those that do not have a native package manager yet (do a good enough job of this and it may even become the de facto standard on those platforms). I have an idea for this aspect... I'm gonna rewriting Debian's dh_make (template based debianizing tool), moving all the code that extracts information from the source into the templates themselves (working title, "izer"). It should be possible to create any format of package from a distutils module. See... http://packages.debian.org/unstable/devel/dh-make.html ...for more about what this tool does, and a link to the (perl) source. The end result being... The "Python database" will only exist for those platforms that do not already have a package system, so you would not need to worry about keeping the two in sync. - Bruce
participants (6)
-
Andrew Kuchling
-
Bruce Sass
-
Dan Milgram
-
Daniel Berlin
-
Mark W. Alexander
-
Pete Shinners