A Modest Proposal for "A Database of Installed Packages"
I'll continue my fool hearty effort [1] to build a concrete proposal for "a database of installed packages" by offering up a sketch of a possible straw-man "solution". I realize that this is likely oversimplified to a fault, but I hope it will help us move forward. Apologies if the equivalent of this has been proposed and rejected before. My proposal is basically to make PKG-INFO functional and usable by: * Fixing the technical issues with requirements (i.e. dependencies) and naming as specified by PEP 314/345. * Modifying distutils to install PKG-INFO alongside each module file or package directory as a side-car file of the same name but with a special extension (.pyi or whatever). These files would be the place to include the optional list of installed files as well as the optional md5sums, if desired by the installer. Files in the package will be listed using relative paths, while far flung files (bin, shared, etc) will get full paths so that there is full allowance for relocating simple (nothing in bin or shared) modules and packages. Although optional, "python setup.py install" will include the installed file list by default. That's it. The intent is to provide just enough information to allow the development of tools to use it, for those that are interested, while being minimally invasive to developers that are not interested in such tools. To determine the current state of your python environment, walk sys.path looking for modules and packages, collecting PKG-INFO when available. No standard centralized database. Some of us will choose to opt-in to a particular installation management tool that might maintain a cache (centralized or per-directory) for efficiency, but that would be considered a performance optimization for that particular tool. We can also bootstrap older python installations by creating an online database (that can of course be downloaded by security conscious individuals for offline querying) that maps (module-file/package-directory name, md5sum) pairs to their respective PKG-INFO contents (no list of installed files) which can be queried by an automated sys.path walker to fill-in missing side-car files. Thus, I can opt-in to this scheme for python 2.5 by installing a distutils patch and meta-data side-car bootstrapper that does its best to identify what's on my sys.path. It would be quite tractable to maintain this for the python standard library and perhaps the official installations of a few major OS versions. Such a database could even be used for the community to provide metadata for packages that the developer didn't (again, furthering an opt-in mentality). Of course, even though it worked for CDDB, it would likely be too much to expect this level of coverage through user submitted entries. [1] http://mail.python.org/pipermail/distutils-sig/2008-March/009108.html
Hello On Fri, Mar 28, 2008 at 11:02:19AM -0400, Alexander Michael wrote:
I'll continue my fool hearty effort [1] to build a concrete proposal for "a database of installed packages" by offering up a sketch of a possible straw-man "solution". I realize that this is likely oversimplified to a fault, but I hope it will help us move forward. Apologies if the equivalent of this has been proposed and rejected before. My proposal is basically to make PKG-INFO functional and usable by:
* Fixing the technical issues with requirements (i.e. dependencies) and naming as specified by PEP 314/345.
* Modifying distutils to install PKG-INFO alongside each module file or package directory as a side-car file of the same name but with a special extension (.pyi or whatever). These files would be the place to include the optional list of installed files as well as the optional md5sums, if desired by the installer. Files in the package will be listed using relative paths, while far flung files (bin, shared, etc) will get full paths so that there is full allowance for relocating simple (nothing in bin or shared) modules and packages. Although optional, "python setup.py install" will include the installed file list by default.
That's it.
This proposal has been here about a week now, with no comments on it. I take that as positive as no one has had major objections. :-) Personally I think it is a good proposal, it does basically what an installation database would have to do while being minimally invasive. The important question is however: Is this enough for setuptools to work withouth doing all it's path magic? Would this be a workable solution for setuptools? Now my own thoughts about the technicalities (sorry this got long)... Distutils does create a ${pkgname}-${version}.egg-info file right now with the PKG-INFO data in. From earlier discussions it seems the .egg-info extension is not very loved, so a change to .pyi could be done (also, it has little to do with eggs). Secondly I'm not sure how useful it is for the version number to be encoded in the filename. It seems the .egg-info file does get installed in the site-packages root currently. This will likely give conflicts when we're starting to use namespace packages. We can't put the .pyi *in* the package since then we lose support for simple modules, so we have to place it *next* to the package. So if "bar" is a namespace package inside "foo" then we would have: site-packages/foo/bar.pyi site-packages/foo/bar/__init__.py This means any package tool will need to recursively scan the site-packages directory to find the files, but that doesn't seem like to much a penalty? The alternative is to have a separate directory for the intalldb files: site-packages/foo/bar/__init__.py site-packages/install.db/foo/bar.pyi This could significantly reduce the scanning time since there are far fewer files too walk. I chose a name with a "." for install.db so we're not stealing a possible module or package name. Other then that the name of the directory can by anything we manage to agree on. :-) Using this approach might create confusion about relative paths mentioned in .pyi files though (is the root the current direcotry or do we pretend the .pyi was actually next to the package/module?). Distribution not providing a package/module or with a different distribution name then the package(s)/module(s) provided would end up in the top-level of the database (in both scenarios), effectively stealing package/module names but that seems to be the current behaviour of distutils already anyway. Namespace sub-distributions (bar in the example above) with a different distribution name as package/module name would steal names from it's namespace. Namespace packages are not fully handled yet, there is still the issue of who owns site-packages/foo/__init__.py. That would logically be defined by site-packages/foo.pyi, but we don't want the user to have to install yet another package for this. So for a namespace package the .pyi could look like this: Metadata-Version: 1.0 Name: foo ... Owner: setuptools Namespace: True Directory: foo/ File: foo/__init__.py It might be possible that a namespace package doesn't need an owner so that a different tool is allowed to clean it up, but I'm not sure about that. When "bar" gets uninstalled now it should know if it can clean up it's namespace "foo" too (if it is empty). So bar.pyi should have: Metadata-Version: 1.0 Name: bar ... Owner: setuptools Requires-Namespaces: foo Directory: foo/bar/ File: foo/bar/__init__.py Here "foo" could also have been a dotted name: "foo.sub.package". So the "foo.sub" package would have both the Namespace: and Requires-Namespace: fields in it's .pyi. AFAIK this should cover namespace packages. So the new headers to turn the PKG-INFO into a .pyi would be: Owner: The owner of this distribution. This would be any string representing the package tool, e.g.: distutils, setuptools, zc.buildout, rpm, dpkg, etc. Provides: Copied from PEP262. Don't like this in it's original form since it's ambiguous. So this lists the *distributions* provided by this package on top of it's native name in the Name: field. Optional (and very rare). Modules: List of packages/modules provided. If no packages or modules are installed this doesn't need to be present. You could argue that this should be called Packages: or so. Derived from PEP262. Namespace: The value of this doesn't matter, when present it indicates this .pyi file describes a namespace package. Requires: Copied from PEP262 (also ambiguous). Optional. Distributions that must be installed for this distribution to work. Requires-Modules: Optional list of packages/modules required. No need to list modules in the standard library. (Figuring out if this site-packages tree is of the correct python version is of no use for the installdb). Derived from PEP262. Requires-Namespaces: This package requires a namespace. The value is a list of dotted names of the namespace packages, as they would appear in an import statement. Directory: A directory from this package. Relative to this .pyi or absolute. For directories inside site-packages they *should* be relative, for outside site-packges they *should* be absolute. File: The value of this is first an optional MD5 hash (or SHA1?) of the file, followed by the path of the installed file (absolute or relative, same rules as for Directory: above). The only restriction this makes on a filename is that you can't have a file in the current directory that is also a valid hash and does not have a hash itself. You can work around this by prepending the filename with ./ however - but why would you want such a file? The only issue I can think of right now is with File:. It is not very extensible if a tool wants too keep track of extra info like file permissions. AFAIK RFC822 requires you to keep the order of the fields, if so we could make this: File: foo.py File-MD5: XXXXXXXXXXXX X-MyTool-File-perms: -rw-rw-r- File: bar.py ... Lastly --and I'm not sure how happy I'm about this, should have thought of this earlier-- the python packaging tools need to support giving away ownership at install time! Since Debian and Redhat etc just call setup.py that would mean each package they install would be owned by distutils/setuptools/... That's bad. I propose that setup.py needs to honour an environment variable: PYI_OWNER so that distros can set this to their custom name (dpkg, rpm, ...). Although I can imagine in Debian's case that it's better to change the dh_py* tools to go and modify the .pyi files. So if all distros are happy with having to modify installed files this might not be necessary. Another a nice/required feature for distros would be to ask the tool to only install the namespace package or omit the namespace packge. This could just be a command line switch to setup.py I think. Again this is not a hard requirement, I can imagine Debian's dh_py* tools to scan the .pyi files, detect namespace packages and (re)move them as required. But once more I don't know enough about other distro's. Phew, thanks for reading this far! I hope this is useful, if it is we should probably start writing the text for the new PEP262 on a wiki somewhere while we discus details. Regards Floris -- Debian GNU/Linux -- The Power of Freedom www.debian.org | www.gnu.org | www.kernel.org
At 10:07 PM 4/5/2008 +0100, Floris Bruynooghe wrote:
This proposal has been here about a week now, with no comments on it. I take that as positive as no one has had major objections. :-)
It's more that there are some holes and handwaving; I haven't really had the mental bandwidth to offer comments on the original proposal as yet. (One comment, though: I really don't like the idea of extending PKG-INFO to include installation data; it's only incidentally related and there are other contexts in which we use PKG-INFO where having that data included would make no sense. Plus, it's really not an ideal file format for including data about a potentially rather large number of files.)
Secondly I'm not sure how useful it is for the version number to be encoded in the filename.
It's very useful for setuptools, as it avoids the need to open and parse the file when searching for a suitable version of a desired package.
It seems the .egg-info file does get installed in the site-packages root currently. This will likely give conflicts when we're starting to use namespace packages.
This doesn't make sense. Namespace packages and project names are not in the same namespace and have nothing to do with each other. For example, I have a project called DecoratorTools that installs a module in the peak.util namespace package. Its egg-info would be something like DecoratorTools-1.6.egg-info. So I think you are confused about something here.
We can't put the .pyi *in* the package since then we lose support for simple modules, so we have to place it *next* to the package.
No, it just goes to the --install-lib directory, which in the default case is site-packages. (But could be a PYTHONPATH or other directory.)
So if "bar" is a namespace package inside "foo" then we would have:
site-packages/foo/bar.pyi site-packages/foo/bar/__init__.py
Ah, I see... you are definitely confusing package names and project names.
This means any package tool will need to recursively scan the site-packages directory to find the files, but that doesn't seem like to much a penalty? The alternative is to have a separate directory for the intalldb files:
site-packages/foo/bar/__init__.py site-packages/install.db/foo/bar.pyi
This could significantly reduce the scanning time since there are far fewer files too walk. I chose a name with a "." for install.db so we're not stealing a possible module or package name. Other then that the name of the directory can by anything we manage to agree on. :-) Using this approach might create confusion about relative paths mentioned in .pyi files though (is the root the current direcotry or do we pretend the .pyi was actually next to the package/module?).
Distribution not providing a package/module or with a different distribution name then the package(s)/module(s) provided would end up in the top-level of the database (in both scenarios), effectively stealing package/module names but that seems to be the current behaviour of distutils already anyway. Namespace sub-distributions (bar in the example above) with a different distribution name as package/module name would steal names from it's namespace.
All of this is moot, since project/distribution names are unrelated to package names.
Namespace packages are not fully handled yet, ...
AFAIK this should cover namespace packages.
Unfortunately, this doesn't fix the problem, since either *some* package has to own the __init__.py, or there has to be a way for Python to treat the directory as a package without one. And for system package managers (esp. on Linux), some *one* system package must own the file - it can't be owned by multiple system packages. My guess is that this is true, *even if* the file is automatically generated. Some system packaging folks will need to chime in here.
Lastly --and I'm not sure how happy I'm about this, should have thought of this earlier-- the python packaging tools need to support giving away ownership at install time! Since Debian and Redhat etc just call setup.py that would mean each package they install would be owned by distutils/setuptools/... That's bad.
I propose that setup.py needs to honour an environment variable: PYI_OWNER so that distros can set this to their custom name (dpkg, rpm, ...).
A command-line option to 'install' that's inherited by 'install_egg_info' would handle this; I don't think an environment variable is a good idea for this -- too implicit. Note that bdist_rpm, for example, would need to encode this as a command-line option in the .spec file, anyway.
Phew, thanks for reading this far! I hope this is useful, if it is we should probably start writing the text for the new PEP262 on a wiki somewhere while we discus details.
The major issues at the moment are that 1) your spec is confused about packages vs. projects or distributions (and thus needs to be revamped with that in mind), and 2) PKG-INFO is a really lousy place to put this, from a formatting perspective. It's one thing to include the PKG-INFO in the install DB, and another thing entirely to include the install db into the PKG-INFO! I think PEP 262 had the right idea, even though I'm not overjoyed by its proposed format, either.
On Sat, Apr 05, 2008 at 07:50:19PM -0400, Phillip J. Eby wrote:
At 10:07 PM 4/5/2008 +0100, Floris Bruynooghe wrote: (One comment, though: I really don't like the idea of extending PKG-INFO to include installation data; it's only incidentally related and there are other contexts in which we use PKG-INFO where having that data included would make no sense. Plus, it's really not an ideal file format for including data about a potentially rather large number of files.)
That's fair. Blowing up the files with the PKG-INFO information in could have bad performance effects. rfc822 in the stdlib reads everything in memory AFAIK.
Secondly I'm not sure how useful it is for the version number to be encoded in the filename.
It's very useful for setuptools, as it avoids the need to open and parse the file when searching for a suitable version of a desired package.
Hmm, it's not that much work to read the contents of a .egg-info. Just seems odd to me to have this info in two places so close to each other. [...]
All of this is moot, since project/distribution names are unrelated to package names.
So this means there is a flat namespace for all project names and nested namespace for modules. When I was saying that project names "steal" names from modules that is because they end up in the same directory. I.e. project "foo" with foo_1.0.egg-info provides module "bar", while project "bar" with bar_1.0.egg-info provides module "bar2". Not ideal. What I was trying to get at was to prefix project names that provide a sub-module for a namespace with the namespace module name. Inside the hypothetical installdb that is. But maybe that makes the whole project namespace vs modules namespace just more confusing (thinking of it definatly a bad idea if the project of the sub-package also installs a script or so). The second part was introducing a "virtual project" for pure namespace packages, where the project name would have to be the same as the package name in order to find it.
AFAIK this should cover namespace packages.
Unfortunately, this doesn't fix the problem, since either *some* package has to own the __init__.py, or there has to be a way for Python to treat the directory as a package without one. And for system package managers (esp. on Linux), some *one* system package must own the file - it can't be owned by multiple system packages.
With the format I suggested a package tool could detect on install if a required pure namespace package was already installed or still needed to be installed/created. Similar on removal it is possible to detect if the pure namespace package is still required (by checking if it's directory contains any other files then those provided by the namespace package) on removal of a sub-package.
My guess is that this is true, *even if* the file is automatically generated. Some system packaging folks will need to chime in here.
System packagers would create 2 packages out of a package requiring a namespace package. One the pure namespace, the other with the sub-package. Other sub-packages then just need to depend on the pure namespace one.
Lastly --and I'm not sure how happy I'm about this, should have thought of this earlier-- the python packaging tools need to support giving away ownership at install time! Since Debian and Redhat etc just call setup.py that would mean each package they install would be owned by distutils/setuptools/... That's bad.
I propose that setup.py needs to honour an environment variable: PYI_OWNER so that distros can set this to their custom name (dpkg, rpm, ...).
A command-line option to 'install' that's inherited by 'install_egg_info' would handle this; I don't think an environment variable is a good idea for this -- too implicit. Note that bdist_rpm, for example, would need to encode this as a command-line option in the .spec file, anyway.
I picked an environment variable here because then it would be possible to call setup.py identical whether or not it provides this new installdb. Providing a non-existing command line option tends to cause more problems.
Phew, thanks for reading this far! I hope this is useful, if it is we should probably start writing the text for the new PEP262 on a wiki somewhere while we discus details.
The major issues at the moment are that 1) your spec is confused about packages vs. projects or distributions (and thus needs to be revamped with that in mind),
See clarification above.
and 2) PKG-INFO is a really lousy place to put this, from a formatting perspective. It's one thing to include the PKG-INFO in the install DB, and another thing entirely to include the install db into the PKG-INFO! I think PEP 262 had the right idea, even though I'm not overjoyed by its proposed format, either.
Not wanting to blow up PKG-INFO is laudable, but OTOH separating out the data is dubious as is replicating data (PKG-INFO data in .egg-info AND the installdb). PKG-INFO was just simple as it's there and tools can use it already. Maybe we're making it too hard by wanting to cover *every* file installed by python projects? The main reason for this installdb, as I understand it, is so that a package tool can install a sub-project in a namespace package installed by someone else. And similarly that someone else doesn't wipe away the sub-package when it thinks it can remove the namespace package. Ah, this make me think of the people that complain on comp.lang.python that Python namespaces are too tightly bound to files and directories... It all makes sense now, we wouldn't even be having this discussion if a package could declare it's namespace in the code! ;-) Regards Floris -- Debian GNU/Linux -- The Power of Freedom www.debian.org | www.gnu.org | www.kernel.org
At 02:18 AM 4/6/2008 +0100, Floris Bruynooghe wrote:
On Sat, Apr 05, 2008 at 07:50:19PM -0400, Phillip J. Eby wrote:
At 10:07 PM 4/5/2008 +0100, Floris Bruynooghe wrote: (One comment, though: I really don't like the idea of extending PKG-INFO to include installation data; it's only incidentally related and there are other contexts in which we use PKG-INFO where having that data included would make no sense. Plus, it's really not an ideal file format for including data about a potentially rather large number of files.)
That's fair. Blowing up the files with the PKG-INFO information in could have bad performance effects. rfc822 in the stdlib reads everything in memory AFAIK.
Secondly I'm not sure how useful it is for the version number to be encoded in the filename.
It's very useful for setuptools, as it avoids the need to open and parse the file when searching for a suitable version of a desired package.
Hmm, it's not that much work to read the contents of a .egg-info. Just seems odd to me to have this info in two places so close to each other.
It allows pkg_resources to grok the entire contents of a directory using only a single listdir operation -- not an unbounded number of open-and-read operations. Of course, if we're going to *also* have a properly-named .egg-info file, then using just the project name is sufficient for the install db.
[...]
All of this is moot, since project/distribution names are unrelated to package names.
So this means there is a flat namespace for all project names and nested namespace for modules. When I was saying that project names "steal" names from modules that is because they end up in the same directory. I.e. project "foo" with foo_1.0.egg-info provides module "bar", while project "bar" with bar_1.0.egg-info provides module "bar2". Not ideal.
I have no idea what you're saying here. There is absolutely no relationship between project names and the Python package/module namespace. None. Thus, any attempt to talk about them as though they are related is just noise to me.
The second part was introducing a "virtual project" for pure namespace packages, where the project name would have to be the same as the package name in order to find it.
I think there would also need to be some prefix to the name, to prevent confusion in the event that there exists a normal project name that happens to use that package name. (Again: the two namespaces are unrelated, so a new/reserved namespace would be required for these virtual projects.)
AFAIK this should cover namespace packages.
Unfortunately, this doesn't fix the problem, since either *some* package has to own the __init__.py, or there has to be a way for Python to treat the directory as a package without one. And for system package managers (esp. on Linux), some *one* system package must own the file - it can't be owned by multiple system packages.
With the format I suggested a package tool could detect on install if a required pure namespace package was already installed or still needed to be installed/created. Similar on removal it is possible to detect if the pure namespace package is still required (by checking if it's directory contains any other files then those provided by the namespace package) on removal of a sub-package.
Again... some system packaging folks need to speak up on this, because my understanding is that some tools simply can't do something like this. They need to make explicit what a given package depends on, and install that, not dynamically decide what dependencies something has. (And then there is the possibility of a problem if a non-system packager installs the namespace, and then you install a system package for something that includes packages in that namespace.)
Lastly --and I'm not sure how happy I'm about this, should have thought of this earlier-- the python packaging tools need to support giving away ownership at install time! Since Debian and Redhat etc just call setup.py that would mean each package they install would be owned by distutils/setuptools/... That's bad.
I propose that setup.py needs to honour an environment variable: PYI_OWNER so that distros can set this to their custom name (dpkg, rpm, ...).
A command-line option to 'install' that's inherited by 'install_egg_info' would handle this; I don't think an environment variable is a good idea for this -- too implicit. Note that bdist_rpm, for example, would need to encode this as a command-line option in the .spec file, anyway.
I picked an environment variable here because then it would be possible to call setup.py identical whether or not it provides this new installdb. Providing a non-existing command line option tends to cause more problems.
How so, if this is going into a new version of Python?
Not wanting to blow up PKG-INFO is laudable, but OTOH separating out the data is dubious as is replicating data (PKG-INFO data in .egg-info AND the installdb). PKG-INFO was just simple as it's there and tools can use it already.
Maybe we're making it too hard by wanting to cover *every* file installed by python projects? The main reason for this installdb, as I understand it, is so that a package tool can install a sub-project in a namespace package installed by someone else. And similarly that someone else doesn't wipe away the sub-package when it thinks it can remove the namespace package.
It's not just about namespace packages, it's about any package or module. We also want to know about installed scripts, data, etc., so that they can be cleaned up by a tool that does uninstalls.
Ah, this make me think of the people that complain on comp.lang.python that Python namespaces are too tightly bound to files and directories... It all makes sense now, we wouldn't even be having this discussion if a package could declare it's namespace in the code! ;-)
Or if you could import from directories without needing there to be an __init__.py, and Python supported namespace packages by default.
Rather than post my comments in-line, I will summarize what I see as the key points raised by the discussion over the weekend. 1. The strawman proposal did not explicitly mention how python packages (and modules) would be assigned to a distribution and make clear the distinction between packages and distributions 2. The strawman proposal did not explicitly address how optional add-on tools (like setuptools) might manage namespace packages. 3. PKG-INFO possibly makes a poor the conduit for the proposed installation metadata both because its usage in my original proposal confuses packages with distributions and its file format is perhaps inefficient for the purpose. 4. Concerns were raised about the performance penalty for using the side-car style files without version numbers possibly not all of which were located at the top-most level of the directory listed in the python path. I will respond to each of these in turn below. 1. The strawman proposal did not explicitly mention how python packages (and modules) would be assigned to a distribution and make clear the distinction between packages and distributions The unstated thought was that the side-car file would contain a line like: Provided-By: SomeDistribution that would assign the python package to a distribution. The side-car files would be named like the package, and there would no standard centralized database of distributions. The reasons for proposing it like this are: a. I believe that having side-car files that sit alongside packages because they have the same base name makes the database more transparent to the uninitiated. Just browsing a directory of python packages will allow you to see what's going on. Moving like-names files around manually maintains the integrity and availability of the data. I think that having magic entries in an essentially "hidden" directory somewhere will cause all sorts of trouble that could be avoiding at the cost of a small bit of duplication. b. I assume, perhaps incorrectly, that most distributions contain only a single package. That said, I do agree that if you are primarily interested in a database of *distributions* (as opposed to *packages*) then something like is proposed in PEP 262 makes more sense (but it would have to be per directory and not site-wide due to the dynamic nature of the python path). This is a trade-off between putting the metadata up front in an obvious and easy to understand way so that it will hopefully have a better chance of being noticed and maintained, versus tucking it away hidden someplace so that even though it is broken, it doesn't bother anyone until they care enough to fix it. *It is this trade-off that I am exploring with this strawman "counter" proposal to PEP 262.* 2. The strawman proposal did not explicitly address how optional add-on tools (like setuptools) might manage namespace packages. I agree with Floris that the best way to avoid magic is to actually have the sub-packages in a namespace share the same parent directory on disk. Since the goal of my proposal is to create the necessary metadata infrastructure so that add-on tools can be used to manage a standard python installation (i.e. no runtime support), I don't see any other way to support this feature in the proposal. Of course, non-standard features like zipped eggs and such could still be deployed using whatever tools and trickery are necessary to achieve the desired ends. To support this, we could indeed add a flag inside the side-car file indicating that the package is a namespace package and that one would need to recurse into it to see what is installed. Python-based installers could create the namespace directory on the fly by default or optionally when needed and system packagers could require a namespace system-level package. 3. PKG-INFO possibly makes a poor the conduit for the proposed installation metadata both because its usage in my original proposal confuses packages with distributions and its file format is perhaps inefficient for the purpose. Using PKG-INFO was just an attempt to be incremental and make use of what is already there. With the practice of including more than cursory documentation in the Description, perhaps it is too much and should be pared down for this purpose, or thrown out altogether if it really isn't the right thing. I'll address performance in the next point. 4. Concerns were raised about the performance penalty for using the side-car style files without version numbers possibly not all of which were located at the top-most level of the directory listed in the python path. Any add-on tool that actually used the data would very likely need to build a cache of the data using a more efficient representation, particularly if the add-on tool had distribution oriented view of the installation. The goal is not to support runtime scanning and manipulation of the data for use by add-on tools that work with the python path in non-standard ways, but to put in place a mechanism to merely make the metadata available for those who opt-in to the usage of such tools as well as for non-tool users to manually inspect. Once a user opts-in to such an add-on tool, they might be expected to use for all of their installations if they want to avoid rebuilding the database cache etc., but could always resync with whats on disk by explicitly rebuilding the database.
On Sat, Apr 05, 2008 at 10:49:24PM -0400, Phillip J. Eby wrote:
At 02:18 AM 4/6/2008 +0100, Floris Bruynooghe wrote:
On Sat, Apr 05, 2008 at 07:50:19PM -0400, Phillip J. Eby wrote:
At 10:07 PM 4/5/2008 +0100, Floris Bruynooghe wrote: (One comment, though: I really don't like the idea of extending PKG-INFO to include installation data; it's only incidentally related and there are other contexts in which we use PKG-INFO where having that data included would make no sense. Plus, it's really not an ideal file format for including data about a potentially rather large number of files.)
That's fair. Blowing up the files with the PKG-INFO information in could have bad performance effects. rfc822 in the stdlib reads everything in memory AFAIK.
Secondly I'm not sure how useful it is for the version number to be encoded in the filename.
It's very useful for setuptools, as it avoids the need to open and parse the file when searching for a suitable version of a desired package.
Hmm, it's not that much work to read the contents of a .egg-info. Just seems odd to me to have this info in two places so close to each other.
It allows pkg_resources to grok the entire contents of a directory using only a single listdir operation -- not an unbounded number of open-and-read operations.
I'm still not thrilled. To quote the "Rejected Suggestions" section of PEP 262: "First, performance is probably not an extremely pressing concern as the database is only used when installing or removing software, a relatively infrequent task." Yet, it's a done fact so there's no point in me complaining about it - I'll live with it.
The second part was introducing a "virtual project" for pure namespace packages, where the project name would have to be the same as the package name in order to find it.
I think there would also need to be some prefix to the name, to prevent confusion in the event that there exists a normal project name that happens to use that package name. (Again: the two namespaces are unrelated, so a new/reserved namespace would be required for these virtual projects.)
Sounds sensible.
AFAIK this should cover namespace packages.
Unfortunately, this doesn't fix the problem, since either *some* package has to own the __init__.py, or there has to be a way for Python to treat the directory as a package without one. And for system package managers (esp. on Linux), some *one* system package must own the file - it can't be owned by multiple system packages.
With the format I suggested a package tool could detect on install if a required pure namespace package was already installed or still needed to be installed/created. Similar on removal it is possible to detect if the pure namespace package is still required (by checking if it's directory contains any other files then those provided by the namespace package) on removal of a sub-package.
Again... some system packaging folks need to speak up on this, because my understanding is that some tools simply can't do something like this. They need to make explicit what a given package depends on, and install that, not dynamically decide what dependencies something has. (And then there is the possibility of a problem if a non-system packager installs the namespace, and then you install a system package for something that includes packages in that namespace.)
As for dpkg it will just overwirte an existing __init__.py in the namespace package if it doesn't own it. It won't even tell you it did so (I was surprised at this). However --and I know you don't like this-- this still is no problem. What we are concerned here is that a user or sysadmin owned directory on the sys.path can be managed sanely. dpkg and co will keep out of those, they have /usr/lib to play in, and sysadmins or users should stay out of /usr/lib in their turn. What is needed to cooperate with system packagers is: 1. Detect existing packages on other directories of sys.path and accept them to satisfy dependencies on the distribution being installed. 2. Find a solution for a namespace package spread out over two directories of sys.path.
Maybe we're making it too hard by wanting to cover *every* file installed by python projects? The main reason for this installdb, as I understand it, is so that a package tool can install a sub-project in a namespace package installed by someone else. And similarly that someone else doesn't wipe away the sub-package when it thinks it can remove the namespace package.
It's not just about namespace packages, it's about any package or module. We also want to know about installed scripts, data, etc., so that they can be cleaned up by a tool that does uninstalls.
No, it's only about namespace packages. Everything else is easy, each tool can keep their own database of installed package in a suitable location if it wants to do that. If you didn't install a file you don't remove it.
Ah, this make me think of the people that complain on comp.lang.python that Python namespaces are too tightly bound to files and directories... It all makes sense now, we wouldn't even be having this discussion if a package could declare it's namespace in the code! ;-)
Or if you could import from directories without needing there to be an __init__.py, and Python supported namespace packages by default.
Also good point. I'm sure people can come up with negative site-effect of this but I can't come up with any myself now. So any takers? Is this a possible option to solve the problem? What is the reason for requiring __init__.py? The longer this discussion goes on the less I like the idea of a full PEP 262 style database (I do admit that at first it seemed like a reasonable idea to me). One issue I've always had with it is that it suddenly stores management data in library directories (it should live in /var). The .egg-info files do already do this, but then they only really provide the sort of information that can be found in .so files of shared libraries but for python files. To summarise what I think are the issues: * Python packaging tools (distutils, setuptools) need to be able to detect packages on all sys.path directories and use them to satisfy dependencies. AIUI this is already done in Python 2.5 with the .egg-info files. * Python packaging tools need to be able to share namespace packages in a user owned sys.path/site-packages directory. Installation and removal of the __init__.py needs coordination between the different tools. This is what PEP 262 could solve, but it's not necesarily the best or most loved solution. * Namespace packages need to be able to be spread over multiple sys.path directories so that the system can provide part of it, the sysadmin some more and the user yet another sub-package. -- Debian GNU/Linux -- The Power of Freedom www.debian.org | www.gnu.org | www.kernel.org
At 09:04 PM 4/9/2008 +0100, Floris Bruynooghe wrote:
On Sat, Apr 05, 2008 at 10:49:24PM -0400, Phillip J. Eby wrote:
At 02:18 AM 4/6/2008 +0100, Floris Bruynooghe wrote:
On Sat, Apr 05, 2008 at 07:50:19PM -0400, Phillip J. Eby wrote:
At 10:07 PM 4/5/2008 +0100, Floris Bruynooghe wrote: (One comment, though: I really don't like the idea of extending PKG-INFO to include installation data; it's only incidentally related and there are other contexts in which we use PKG-INFO where having that data included would make no sense. Plus, it's really not an ideal file format for including data about a potentially rather large number of files.)
That's fair. Blowing up the files with the PKG-INFO information in could have bad performance effects. rfc822 in the stdlib reads everything in memory AFAIK.
Secondly I'm not sure how useful it is for the version number to be encoded in the filename.
It's very useful for setuptools, as it avoids the need to open and parse the file when searching for a suitable version of a desired package.
Hmm, it's not that much work to read the contents of a .egg-info. Just seems odd to me to have this info in two places so close to each other.
It allows pkg_resources to grok the entire contents of a directory using only a single listdir operation -- not an unbounded number of open-and-read operations.
I'm still not thrilled. To quote the "Rejected Suggestions" section of PEP 262: "First, performance is probably not an extremely pressing concern as the database is only used when installing or removing software, a relatively infrequent task."
Yet, it's a done fact so there's no point in me complaining about it - I'll live with it.
You're conflating .egg-info and PEP 262 -- there is no connection between the two, except the similarity of using a single file per installed distribution to implement a database of sorts.
What is needed to cooperate with system packagers is:
1. Detect existing packages on other directories of sys.path and accept them to satisfy dependencies on the distribution being installed.
This part is already handled by .egg-info.
2. Find a solution for a namespace package spread out over two directories of sys.path.
That part's easy - pkg_resources will already do that. It's the handling of namespace packages when they're being installed by a system packager where things get dicey. However, at least the existing .pth-based solution used by setuptools will work for a setuptools-based package. And setuptools could use a setuptools-based solution elsewhere (i.e., handle the overlapping packages' use of __init__.py). The only place where a problem could come in is if you install other namespace-packaged things to the same directory as your system package manager... but I suppose we could just say, "don't do that." :)
No, it's only about namespace packages. Everything else is easy, each tool can keep their own database of installed package in a suitable location if it wants to do that. If you didn't install a file you don't remove it.
Well, if that were true, then we could handle namespace packages in the same way. :) However, I would like setuptools and distutils at least, to use the same format or a compatible format.
The longer this discussion goes on the less I like the idea of a full PEP 262 style database (I do admit that at first it seemed like a reasonable idea to me). One issue I've always had with it is that it suddenly stores management data in library directories (it should live in /var).
Keep in mind that there are platforms and use cases where the FHS makes no sense to start with. FHS is for systems, not applications, for example. Does Firefox split its user profile directories across lib, var, and etc? After all, they contain code, data, and configuration.
To summarise what I think are the issues:
* Python packaging tools (distutils, setuptools) need to be able to detect packages on all sys.path directories and use them to satisfy dependencies. AIUI this is already done in Python 2.5 with the .egg-info files.
Yep.
* Python packaging tools need to be able to share namespace packages in a user owned sys.path/site-packages directory. Installation and removal of the __init__.py needs coordination between the different tools. This is what PEP 262 could solve, but it's not necesarily the best or most loved solution.
Right; this really does seem to be the main issue. Setuptools solves it for "site" directories (e.g. site-packages) using .pth files, but it is not an ideal solution. It also won't work for non-site directories, which means I'd have to keep the site.py hack for PYTHONPATH dirs. Not having a uniform way to address it is also an implementation issue, since setuptools will need to know which way it's solving the problem. I suppose easy_install could detect the presence of other nspkg.pth files, and choose to use that method in that event. But I'd much rather get rid of the nspkg.pth files, as they are second only to the easy_install site.py hack in their nastiness.
* Namespace packages need to be able to be spread over multiple sys.path directories so that the system can provide part of it, the sysadmin some more and the user yet another sub-package.
This part is already solved by pkg_resources, or for that matter, by pkgutil. (pkgutil is only suitable if you're not using eggs, though.)
participants (3)
-
Alexander Michael
-
Floris Bruynooghe
-
Phillip J. Eby