PEP 376 and PEP 302 - allowing import hooks to provide distribution metadata

This post basically follows on from the previous thread about PEP 376 (Changing the .egg-info structure) where it was pointed out that the PEP doesn't cover PEP 302 import hooks (except in the explicit special case of zip files). This is likely to be a fairly long posting, but I'd like to try and cover the subtleties a little, so that people can comment without having to refer back to too many documents. Comments are most definitely welcome! While I know PEP 302 reasonably well, and the zip importer implementation, I'm not an expert on the egg-info structure, or the practical drivers behind it, so if I miss key issues because of a too-theoretical approach, I'd be grateful for corrections. The basic structure of PEP 302 imports is as follows: - scan sys.meta_path checking each finder in turn - if nothing found, scan sys.path and for each entry pass it to each element of sys.path_hooks, to get a finder to check - do the default filesystem processing on path items for which no path hook applies The first finder (if any) to recognise the module name wins (and returns a loader responsible for creating the actual module object). So it's the finders that are responsible for scanning the "filesystem" (more accurately, the namespace that finder manages). PEP 302 is entirely couched in terms of modules, packages and module names. There is no concept of a "distribution" in PEP 376 terms. This is entirely understandable, as imports don't know about distributions (see the docutils example in PEP 376 - the docutils *distribution* contains the *module* roman, but there's nothing about importing roman that requires you to know that it came from the docutils distribution). So we need to introduce a new concept, that of a distribution, into PEP 302 loaders. And the place it should be located is in the finder (which handles the "filesystem" aspects of the protocol. So we have the methods: Finder.list_distributions() Returns a list of all distribution names in the "filesystem" managed by the finder (usually one zip file, path directory, salite database, or whatever) Finder.get_metadata(distribution_name) Returns a metadata object for the given distribution (this is the PEP 376 Distribution object). One note here - get_egginfo_file should be specified as returning a *file-like object* rather than a file instance. Another note - get_egginfo_file and get_egginfo_files could just as easily be named get_metadata_file and get_metadata_files - just as meaningful, and less tied to the egg format (as well as avoiding the "egg" name, which I personally dislike :-)) Finder.uninstall(distribution_name, filter=callable, installer=name) Uninstall the given distribution. It's likely that many finders will be read-only. In that case, this function should return None. Otherwise, return a list of the "files" removed. (This may need some clarification, as many finders won't have a concept of a "file name"). I don't think anything else is needed to support PEP 376. The prototype implementation of PEP 376 could be extended to work with finders (doing the relevant sys.meta_path and sys.path_hooks searches). For the final implementation, the special casing of zip files would be replaced by an implementation of the extended finder protocol in the zipimporter object. There's almost certainly aspects missing from the above proposal. But it does have some definite advantages, above and beyond simply allowing PEP 302 importers to participate in the PEP 376 protocol. Setuptools-style eggs could be handled simply by creating a specialised finder (IIUC, they currently just use the standard zipimporter - the specialised version could subclass this) to override the metadata methods so as to cater for their specialised egg-info format. Other formats could be handled similarly. Does this sound sensible? Tarek, would you be OK with me attempting to modify your prototype to support this protocol? Are there any tests for PEP 376, so that I can confirm I haven't completely broken something? If I can, I'll knock up some simple prototype importers for non-standard examples, and see how they work with all this. Paul.

2009/7/3 Paul Moore <p.f.moore@gmail.com>:
Yes that's exactly what I was thinking about after the discussion we had in the other thread. This change would allow much more flexibility. Pleas go ahead, it's located here : http://bitbucket.org/tarek/pep376/ You can give me a bitbucket account so I can give you write access to the repo, There are tests as long as you install Nose. Tarek

Tarek Ziadé wrote:
It's also very similar to the ideas that started to tick over in the back of my head after the previous discussion (only far more fleshed out that anything I got to - fortunately I read Paul's message before spending too much time on it!) Specifically regarding this comment from Paul:
PEP 302 finders and loaders actually *are* expected to have a concept of a "file name". For Python modules they they are expected to have it so they can populate __file__ correctly (runpy is also a lot happier with them when they expose get_filename(), a relatively recent addition to the PEP 302 API). For other resources they're expected to have it so the get_data() API can work correctly. One other thing to consider: it may be desirable to have hooks for meta_path and path_hooks in distutils that are checked *before* the full import hooks in the sys module. The reason I say this is that it would allow someone to override just the metadata retrieval (e.g. to pick up the extra dependency information saved by setuptools and friends) without having to worry about breaking the actual imports from those locations. Cheers, Nick. P.S. +lots on using 'metadata' in the PEP 376 method names rather than the jargon 'egginfo'. Jargon isn't always bad, but using it seems fairly gratuitous in this case. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Fri, Jul 3, 2009 at 17:47, Nick Coghlan <ncoghlan@gmail.com> wrote:
Not only that, lots of code out there expects values from __path__ to contain os.path.sep on top of __file__, so the idea of file paths is already hard-coded into how import works, whether we like it or not.
An addition I was not even aware of. Looks like importlib needs a little updating.
So as the guy who plans to try to clean up import, I just want to say that implicit finders will eventually go away. Do not start off making them implicit, but you can have them installed on sys.meta_path or sys.path_hooks by default.
Ditto from here. Plus I have an aversion to terminology that goes down the reptile route instead of the Monty Python route. -Brett

2009/7/4 Brett Cannon <brett@python.org>:
Yes, consider me duly reprimanded :-) They aren't *file* names, but they still have to look similar enough to satisfy code that uses things like filename manipulation functions on them.
Yes, I didn't notice that sneak into PEP 302, either :-)
I'm against having distutils-specific hooks. I don't see a genuine need, and it adds complexity so let's wait for a proper use case before adding that. With some suitably ugly hacking, you could probably add a finder to sys.meta_path which never actually found a module, but which reported one or more distributions for the distutils metadata APIs. Playing around with that idea could likely give you any flexibility you needed (at the hopefully minor cost of an extra dummy meta_path call on each import). As far as the currently implicit filesystem path handling is concerned, I'm aware that the intention is for importlib to move these into an explicit importer at some point. For my prototype changes to Tarek's prototype, I'm creating a dummy filesystem finder with just the new methods needed for the distribution metadata. It will be easy to move these into a real filesystem finder when one emerges. Paul.

Paul Moore wrote:
I probably should have been noisier about that when I added it. I'm pretty sure it did come up on this list, but it would have been somewhere in the middle of a runpy discussion: runpy was stuck because it only uses PEP 302 to *find* the modules it needs, but not to actually load them. In the original version of PEP 302 the only way to get a loader to tell you the filename was to load the module and see what it set __file__ to, which wasn't useful in the runpy case. If I recall correctly, at the time when PJE was rationalising the code duplication between runpy and pkglib the comment was also made that runpy's get_filename() optional loader extension should be mentioned in PEP 302. runpy still works for loaders that don't provide it, it just can't set __file__ or sys.argv[0] correctly in those cases. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

I'm not complaining, just saying I didn't notice. I essentially did the same with get_source/get_bytecode so as to be able to set __file__. When I get around to it I will create a RunpyLoader in importlib.abc and add the method in the proper machinery loaders. On Jul 4, 2009 3:43 PM, "Nick Coghlan" <ncoghlan@gmail.com> wrote: Paul Moore wrote: > 2009/7/4 Brett Cannon <brett@python.org>: pretty sure it did come up on this list, but it would have been somewhere in the middle of a runpy discussion: runpy was stuck because it only uses PEP 302 to *find* the modules it needs, but not to actually load them. In the original version of PEP 302 the only way to get a loader to tell you the filename was to load the module and see what it set __file__ to, which wasn't useful in the runpy case. If I recall correctly, at the time when PJE was rationalising the code duplication between runpy and pkglib the comment was also made that runpy's get_filename() optional loader extension should be mentioned in PEP 302. runpy still works for loaders that don't provide it, it just can't set __file__ or sys.argv[0] correctly in those cases. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

2009/7/4 Brett Cannon <brett@python.org>:
If it turns out that we use PEP 302-like loaders, I am also suggesting that the default metadata directory name used in Distutils is changed to "DIST_NAME.metadata". The loader would still work with "DIST_NAME.egg-info" directories for compatibility with existing format in the query APIs, but the Distutils install command would rather create "DIST_NAME.metadata"

2009/7/3 Tarek Ziadé <ziade.tarek@gmail.com>:
You can give me a bitbucket account so I can give you write access to the repo, There are tests as long as you install Nose.
How do I get the tests to work? Just running nosetests gives an error (probably because pkgutil is being imported from the stdlib, rather than from this directory). If I set PYTHONPATH=. then I get errors. I suspect path normalisation (for backslashes) in the zipfile handling. Yes, it appears to be caused by the fact that Distribution uses os.path.join to construct self.pkg_info_path (and self.record_path) and yet zipfiles expect '/' separated pathnames. Paul. Error log:
====================================================================== ERROR: test_pkgutil.test_zipped_directory ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\Apps\Python26\Lib\site-packages\nose\case.py", line 183, in runTest self.test(*self.arg) File "C:\Users\Gustav\Data\pep376\test_pkgutil.py", line 80, in test_zipped_directory dir = ZippedDistributionDir(SITE_PKG+'.zip') File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 324, in __init__ self.add(ZippedDistribution(self._zip_file, paths[0])) File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 259, in __init__ super(ZippedDistribution, self).__init__(path) File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 154, in __init__ pkginfo = self._open_pkginfo() File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 264, in _open_pkginfo return self.zipfile.open(self.pkg_info_path) File "C:\Apps\Python26\lib\zipfile.py", line 859, in open zinfo = self.getinfo(name) File "C:\Apps\Python26\lib\zipfile.py", line 826, in getinfo 'There is no item named %r in the archive' % name) KeyError: "There is no item named 'mercurial-1.0.1.egg-info\\\\PKG-INFO' in the archive" ====================================================================== ERROR: test_pkgutil.test_zipped_distribution ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\Apps\Python26\Lib\site-packages\nose\case.py", line 253, in setUp try_run(self.test, names) File "C:\Apps\Python26\Lib\site-packages\nose\util.py", line 487, in try_run return func() File "C:\Users\Gustav\Data\pep376\test_pkgutil.py", line 90, in setup_zip pkgutil._dist_dirs.load(SITE_PKG+'.zip') File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 380, in load dist_dir = ZippedDistributionDir(path) File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 324, in __init__ self.add(ZippedDistribution(self._zip_file, paths[0])) File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 259, in __init__ super(ZippedDistribution, self).__init__(path) File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 154, in __init__ pkginfo = self._open_pkginfo() File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 264, in _open_pkginfo return self.zipfile.open(self.pkg_info_path) File "C:\Apps\Python26\lib\zipfile.py", line 859, in open zinfo = self.getinfo(name) File "C:\Apps\Python26\lib\zipfile.py", line 826, in getinfo 'There is no item named %r in the archive' % name) KeyError: "There is no item named 'mercurial-1.0.1.egg-info\\\\PKG-INFO' in the archive" ====================================================================== FAIL: test_pkgutil.test_distribution ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\Apps\Python26\Lib\site-packages\nose\case.py", line 183, in runTest self.test(*self.arg) File "C:\Users\Gustav\Data\pep376\test_pkgutil.py", line 60, in test_distribution os.path.join(SITE_PKG, 'mercurial-1.0.1.egg-info/RECORD')]) AssertionError: ['C:\\Users\\Gustav\\Data\\pep376\\site-packages\\mercurial-1.0.1.egg-info\\PKG_INFO', 'C:\\Users\\ Gustav\\Data\\pep376\\site-packages\\mercurial-1.0.1.egg-info\\RECORD'] != ['C:\\Users\\Gustav\\Data\\pep376\\site- packages\\mercurial-1.0.1.egg-info/PKG_INFO', 'C:\\Users\\Gustav\\Data\\pep376\\site-packages\\mercurial-1.0.1.egg- info/RECORD'] ---------------------------------------------------------------------- Ran 7 tests in 0.019s FAILED (errors=2, failures=1) 13:40 C:\Users\Gustav\Data\pep376
2531 0 0 19/06/09 17:14 mercurial/filelog.py 3449 0 0 19/06/09 17:14 mercurial/filelog.pyc 234 0 0 19/06/09 17:14 mercurial-1.0.1.egg-info/PKG-INFO 162 0 0 19/06/09 17:14 mercurial-1.0.1.egg-info/RECORD 0 0 0 19/06/09 17:14 processing/__init__.py 3482 0 0 19/06/09 17:14 processing-0.52.egg-info/PKG-INFO 165 0 0 19/06/09 17:14 processing-0.52.egg-info/RECORD -------- ----- ----- ------- 10023 0 0 7 files Paul.

2009/7/4 Paul Moore <p.f.moore@gmail.com>:
Actually, the test assert_equals(list(dist.get_egginfo_files(local=True)), [os.path.join(SITE_PKG, 'mercurial-1.0.1.egg-info/PKG_INFO'), os.path.join(SITE_PKG, 'mercurial-1.0.1.egg-info/RECORD')]) is broken, because the expected value uses slashes, which are *not* the local separator on win32. I've attached a patch. But there's 2 comments I'd make (one minor, one major) Minor one: The tests often seem to be exercising the internal classes, not so much the public API, so many of them will probably not be of much use to me :-( Major one: I'm not entirely sure what the purpose of the "local" flag is. With local=True, you are supposed to get a filesystem path in the local form. But that makes no sense for non-filesystem based loaders (e.g. zipfiles). So how is a user expected to use the value? With local=false, you get slash-separated files (which is good, because it's consistent) which are either absolute (see my earlier comments about how broken this is in practice) or relative to the "directory" containing the egg-info file (where this is of course not necessarily a real filesystem location). But these aren't useful as filesystem names, because they aren't local format. So what is the point of the get_installed_files, uses, and get_egginfo_files functions? Who will use the returned values, and what for? Where will a user get a value to pass to uses()? Given that get_egginfo_file takes names relative to the egginfo dir, shouldn't get_egginfo_files *return* values relative to the same location, so that they can be used there? I guess the same type of argument applies with get_inetalled_files() and uses() - the values returned by get_installed_files can be seen as a set of opaque tokens (OK, realistically they will be "filenames" in some sense) which can be passed to uses() - but if they aren't absolute filesystem paths, you can't compare them. Example: distribution a contains file p.py distribution b contains file p.py pkgutil.get_distribution(a).get_installed_files contains 'p.py' pkgutil.get_distribution(b).uses('p.py') is True What does this mean? If the 2 distributions are in the same sys.path element, there's a clash. If not, is there a clash? What if they are in 2 different directories? What if they are in the same directory, but it's repeated under different names (e.g. symlinks) on sys.path? What if one is in a directory and the other in a zip file? This isn't even related to non-filesystem loaders. It's a fundamental issue with the whole API. What you seem to want is actually a canonical "name" for a file object - for filesystem files, os.path.normcase(os.path.normpath(filename)) is probably good enough, although it doesn't deal with symlinks. For other types of loader, you have to rely on the loader itself - loader.get_fullname() is probably OK. But even then, it's not clear how user code would actually *use* these APIs. I think you need some real-world use cases, with actual sample (pseudo-)code, to validate the design here. As things stand, it's both confusing and (I suspect) unusable in practice. Sorry, I know that sounds negative, but if this isn't to be a source of subtle bugs for years to come, it needs to be clarified now. PEP 302 is still hitting this type of issue - runpy and importlib have brought out errors and holes in the protocol quite recently - even though Just and I went to great lengths to try to tease out hidden assumptions up front. Concrete proposal: get_metadata_files() - returns slash-separated names, relative to the egginfo dir get_metadata_file(path) - path must be slash-separated, relative to the egginfo dir get_installed_files - returns the contents of RECORD unaltered uses(path) - checks if path is in RECORD The latter 2 are not very useful in practice - you can't say anything about entries in different RECORD files, which is likely the real use case you want. Maybe RECORD could have an extra "Location" entry, which determines where it exists globally (this would be the directory to which the filenames were relative, in the case of filesystem-based distributions) and RECORD entries are comparable if the Location values in the 2 RECORD files match. That's a lot more complex - but depending on what use people expect to make of these 2 APIs, it may be justified. Paul.

2009/7/4 Paul Moore <p.f.moore@gmail.com>:
I just run them from within the directory
If I set PYTHONPATH=. then I get errors. I suspect path normalisation (for backslashes) in the zipfile handling.
Applied, thanks (I didn't run them under win32 yet)
I'll add some more tests then, or even user stories.
Agreed, the zip case was added afterwards, but in practice, the APIs are still dealing with the files are *filesystem files* located in a container (eg a directory or a zip file) located somewhere on the filesystem. "local" in that case is a flag that means "translate a file path expressed in the local filesystem" which make no sense anymore with zip files. But the goal really, is to be able to point out that two distributions are using the very same file. Right now PEP 376 and the prototype code handle these two real world use cases: - browsing regular site-packages-like directories - browsing site-packages-like directories, that are zipped. For example: - I have a "packages.zip" file in /var/, wich is also in my sys.path. It contains a distribution "foo-1.0" that has the "roman.py" file in its root. So the RECORD file located in "foo-1.0.egg-info" has a line starting with "roman.py,..." - Then if I install docutils 0.5 as a regular filesystem distribution, "roman.py" will be added in Python's site-packages. and docutils-0.5.egg-info/RECORD will contain "roman.py,..." with the same hash. The local flag will return these paths: - /var/packages.zip/roman.py <--- not a "real" path - /usr/local/lib/python2.6/site-packages/roman.py So removing the docutils distribution will be doable, because these paths are different.
Yes, In practice, if you look at my previous example, even if "/var/packages.zip/roman.py" isn't a real path, it's enough to compare RECORD entries globally. The "Location" entry you are proposing in that case, would be "/var/packages.zip". But do we really need to store it the RECORD ? Or can't we define an API that returns two elements : - the path to the location (in the example: /var/packages.zip or /usr/local/lib/python2.6/site-packages) - the path within the location itself (in the example: roman.py) A concrete proposal would be to take back your proposal, but return tuples with the location as the first member. e.g. "(location, relative path[s])" The code that is comparing paths to see if they are the same can join location+relative path[s], while we can provide in a dedicated function something to read the content of the file (that would be get_data I guess, if I refer to PEP 302) Tarek

2009/7/5 Tarek Ziadé <ziade.tarek@gmail.com>:
That sounds reasonable. So we can forget the "local" parameter, and return a tuple: - absolute location of the container (directory, zipfile or whatever containing the egginfo file) as a filesystem path in canonical native form (where it's filesystem based) or as an opaque token for the odd cases (frozen modules, for example) where a filesystem location isn't available. - entry from the RECORD file, as a slash-separated filename relative to the root of the container.
Unfortunately, get_data loads data files located within a *package*, using a name relative to the package directory. You can't get at the metadata of a *distribution* like that. But if you're using get_installed_files(), why would you then want to read the files? What exactly would you *use* get_installed_files for which would then leave you needing to read the files? If it's to check they haven't changed (by comparing md5 values) you're doing that to uninstall, so that's the responsibility of the uninstall function. Again, it's a question of what is a public API, and what is the use case it's designed for. I'm currently writing a SQLite importer, which will allow me to store "files" in any sort of database tables I want, so I can build in some nice pathological behaviour. That should tease out some awkward corner cases :-) Paul

On Sun, Jul 5, 2009 at 3:27 PM, Paul Moore<p.f.moore@gmail.com> wrote:
exactly,
Right. These APIs were created for third-party package managers. One use case of a package manager is the uninstallation, but I have no other use case in mind.
Sounds good. Semi-related: even if the files themselves are in the filesystem, having a sqlite db to index the list of installed distributions makes a good cache solution to reduce the disk I/O and speed up the query functions. So maybe we could use a disk-based cache for site-packages-like directories in the form of a sqlite db. That's what I am experimenting on my side.
Paul
-- Tarek Ziadé | http://ziade.org

2009/7/3 Tarek Ziadé <ziade.tarek@gmail.com>:
One important note - I plan on using the fact that DistributionDirMap is not public, and hacking it about drastically, or possibly even removing it. (After all, the likes of the load method don't make sense when you've got sys.meta_path, sys.path_importer_cache and the like to consider). If the loss of any of the "internal" classes is an issue, say so now! Paul.

On Sat, Jul 4, 2009 at 3:04 PM, Paul Moore<p.f.moore@gmail.com> wrote:
Go for it please, and let me know if you set a bitbucket account, so you can push your commits in there directly
Paul.
-- Tarek Ziadé | http://ziade.org

2009/7/5 Tarek Ziadé <ziade.tarek@gmail.com>:
Go for it please, and let me know if you set a bitbucket account, so you can push your commits in there directly
My bitbucket account is 'pmoore'. Paul.

2009/7/3 Paul Moore <p.f.moore@gmail.com>:
Yes that's exactly what I was thinking about after the discussion we had in the other thread. This change would allow much more flexibility. Pleas go ahead, it's located here : http://bitbucket.org/tarek/pep376/ You can give me a bitbucket account so I can give you write access to the repo, There are tests as long as you install Nose. Tarek

Tarek Ziadé wrote:
It's also very similar to the ideas that started to tick over in the back of my head after the previous discussion (only far more fleshed out that anything I got to - fortunately I read Paul's message before spending too much time on it!) Specifically regarding this comment from Paul:
PEP 302 finders and loaders actually *are* expected to have a concept of a "file name". For Python modules they they are expected to have it so they can populate __file__ correctly (runpy is also a lot happier with them when they expose get_filename(), a relatively recent addition to the PEP 302 API). For other resources they're expected to have it so the get_data() API can work correctly. One other thing to consider: it may be desirable to have hooks for meta_path and path_hooks in distutils that are checked *before* the full import hooks in the sys module. The reason I say this is that it would allow someone to override just the metadata retrieval (e.g. to pick up the extra dependency information saved by setuptools and friends) without having to worry about breaking the actual imports from those locations. Cheers, Nick. P.S. +lots on using 'metadata' in the PEP 376 method names rather than the jargon 'egginfo'. Jargon isn't always bad, but using it seems fairly gratuitous in this case. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Fri, Jul 3, 2009 at 17:47, Nick Coghlan <ncoghlan@gmail.com> wrote:
Not only that, lots of code out there expects values from __path__ to contain os.path.sep on top of __file__, so the idea of file paths is already hard-coded into how import works, whether we like it or not.
An addition I was not even aware of. Looks like importlib needs a little updating.
So as the guy who plans to try to clean up import, I just want to say that implicit finders will eventually go away. Do not start off making them implicit, but you can have them installed on sys.meta_path or sys.path_hooks by default.
Ditto from here. Plus I have an aversion to terminology that goes down the reptile route instead of the Monty Python route. -Brett

2009/7/4 Brett Cannon <brett@python.org>:
Yes, consider me duly reprimanded :-) They aren't *file* names, but they still have to look similar enough to satisfy code that uses things like filename manipulation functions on them.
Yes, I didn't notice that sneak into PEP 302, either :-)
I'm against having distutils-specific hooks. I don't see a genuine need, and it adds complexity so let's wait for a proper use case before adding that. With some suitably ugly hacking, you could probably add a finder to sys.meta_path which never actually found a module, but which reported one or more distributions for the distutils metadata APIs. Playing around with that idea could likely give you any flexibility you needed (at the hopefully minor cost of an extra dummy meta_path call on each import). As far as the currently implicit filesystem path handling is concerned, I'm aware that the intention is for importlib to move these into an explicit importer at some point. For my prototype changes to Tarek's prototype, I'm creating a dummy filesystem finder with just the new methods needed for the distribution metadata. It will be easy to move these into a real filesystem finder when one emerges. Paul.

Paul Moore wrote:
I probably should have been noisier about that when I added it. I'm pretty sure it did come up on this list, but it would have been somewhere in the middle of a runpy discussion: runpy was stuck because it only uses PEP 302 to *find* the modules it needs, but not to actually load them. In the original version of PEP 302 the only way to get a loader to tell you the filename was to load the module and see what it set __file__ to, which wasn't useful in the runpy case. If I recall correctly, at the time when PJE was rationalising the code duplication between runpy and pkglib the comment was also made that runpy's get_filename() optional loader extension should be mentioned in PEP 302. runpy still works for loaders that don't provide it, it just can't set __file__ or sys.argv[0] correctly in those cases. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

I'm not complaining, just saying I didn't notice. I essentially did the same with get_source/get_bytecode so as to be able to set __file__. When I get around to it I will create a RunpyLoader in importlib.abc and add the method in the proper machinery loaders. On Jul 4, 2009 3:43 PM, "Nick Coghlan" <ncoghlan@gmail.com> wrote: Paul Moore wrote: > 2009/7/4 Brett Cannon <brett@python.org>: pretty sure it did come up on this list, but it would have been somewhere in the middle of a runpy discussion: runpy was stuck because it only uses PEP 302 to *find* the modules it needs, but not to actually load them. In the original version of PEP 302 the only way to get a loader to tell you the filename was to load the module and see what it set __file__ to, which wasn't useful in the runpy case. If I recall correctly, at the time when PJE was rationalising the code duplication between runpy and pkglib the comment was also made that runpy's get_filename() optional loader extension should be mentioned in PEP 302. runpy still works for loaders that don't provide it, it just can't set __file__ or sys.argv[0] correctly in those cases. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

2009/7/4 Brett Cannon <brett@python.org>:
If it turns out that we use PEP 302-like loaders, I am also suggesting that the default metadata directory name used in Distutils is changed to "DIST_NAME.metadata". The loader would still work with "DIST_NAME.egg-info" directories for compatibility with existing format in the query APIs, but the Distutils install command would rather create "DIST_NAME.metadata"

2009/7/3 Tarek Ziadé <ziade.tarek@gmail.com>:
You can give me a bitbucket account so I can give you write access to the repo, There are tests as long as you install Nose.
How do I get the tests to work? Just running nosetests gives an error (probably because pkgutil is being imported from the stdlib, rather than from this directory). If I set PYTHONPATH=. then I get errors. I suspect path normalisation (for backslashes) in the zipfile handling. Yes, it appears to be caused by the fact that Distribution uses os.path.join to construct self.pkg_info_path (and self.record_path) and yet zipfiles expect '/' separated pathnames. Paul. Error log:
====================================================================== ERROR: test_pkgutil.test_zipped_directory ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\Apps\Python26\Lib\site-packages\nose\case.py", line 183, in runTest self.test(*self.arg) File "C:\Users\Gustav\Data\pep376\test_pkgutil.py", line 80, in test_zipped_directory dir = ZippedDistributionDir(SITE_PKG+'.zip') File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 324, in __init__ self.add(ZippedDistribution(self._zip_file, paths[0])) File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 259, in __init__ super(ZippedDistribution, self).__init__(path) File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 154, in __init__ pkginfo = self._open_pkginfo() File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 264, in _open_pkginfo return self.zipfile.open(self.pkg_info_path) File "C:\Apps\Python26\lib\zipfile.py", line 859, in open zinfo = self.getinfo(name) File "C:\Apps\Python26\lib\zipfile.py", line 826, in getinfo 'There is no item named %r in the archive' % name) KeyError: "There is no item named 'mercurial-1.0.1.egg-info\\\\PKG-INFO' in the archive" ====================================================================== ERROR: test_pkgutil.test_zipped_distribution ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\Apps\Python26\Lib\site-packages\nose\case.py", line 253, in setUp try_run(self.test, names) File "C:\Apps\Python26\Lib\site-packages\nose\util.py", line 487, in try_run return func() File "C:\Users\Gustav\Data\pep376\test_pkgutil.py", line 90, in setup_zip pkgutil._dist_dirs.load(SITE_PKG+'.zip') File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 380, in load dist_dir = ZippedDistributionDir(path) File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 324, in __init__ self.add(ZippedDistribution(self._zip_file, paths[0])) File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 259, in __init__ super(ZippedDistribution, self).__init__(path) File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 154, in __init__ pkginfo = self._open_pkginfo() File "C:\Users\Gustav\Data\pep376\pkgutil.py", line 264, in _open_pkginfo return self.zipfile.open(self.pkg_info_path) File "C:\Apps\Python26\lib\zipfile.py", line 859, in open zinfo = self.getinfo(name) File "C:\Apps\Python26\lib\zipfile.py", line 826, in getinfo 'There is no item named %r in the archive' % name) KeyError: "There is no item named 'mercurial-1.0.1.egg-info\\\\PKG-INFO' in the archive" ====================================================================== FAIL: test_pkgutil.test_distribution ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\Apps\Python26\Lib\site-packages\nose\case.py", line 183, in runTest self.test(*self.arg) File "C:\Users\Gustav\Data\pep376\test_pkgutil.py", line 60, in test_distribution os.path.join(SITE_PKG, 'mercurial-1.0.1.egg-info/RECORD')]) AssertionError: ['C:\\Users\\Gustav\\Data\\pep376\\site-packages\\mercurial-1.0.1.egg-info\\PKG_INFO', 'C:\\Users\\ Gustav\\Data\\pep376\\site-packages\\mercurial-1.0.1.egg-info\\RECORD'] != ['C:\\Users\\Gustav\\Data\\pep376\\site- packages\\mercurial-1.0.1.egg-info/PKG_INFO', 'C:\\Users\\Gustav\\Data\\pep376\\site-packages\\mercurial-1.0.1.egg- info/RECORD'] ---------------------------------------------------------------------- Ran 7 tests in 0.019s FAILED (errors=2, failures=1) 13:40 C:\Users\Gustav\Data\pep376
2531 0 0 19/06/09 17:14 mercurial/filelog.py 3449 0 0 19/06/09 17:14 mercurial/filelog.pyc 234 0 0 19/06/09 17:14 mercurial-1.0.1.egg-info/PKG-INFO 162 0 0 19/06/09 17:14 mercurial-1.0.1.egg-info/RECORD 0 0 0 19/06/09 17:14 processing/__init__.py 3482 0 0 19/06/09 17:14 processing-0.52.egg-info/PKG-INFO 165 0 0 19/06/09 17:14 processing-0.52.egg-info/RECORD -------- ----- ----- ------- 10023 0 0 7 files Paul.

2009/7/4 Paul Moore <p.f.moore@gmail.com>:
Actually, the test assert_equals(list(dist.get_egginfo_files(local=True)), [os.path.join(SITE_PKG, 'mercurial-1.0.1.egg-info/PKG_INFO'), os.path.join(SITE_PKG, 'mercurial-1.0.1.egg-info/RECORD')]) is broken, because the expected value uses slashes, which are *not* the local separator on win32. I've attached a patch. But there's 2 comments I'd make (one minor, one major) Minor one: The tests often seem to be exercising the internal classes, not so much the public API, so many of them will probably not be of much use to me :-( Major one: I'm not entirely sure what the purpose of the "local" flag is. With local=True, you are supposed to get a filesystem path in the local form. But that makes no sense for non-filesystem based loaders (e.g. zipfiles). So how is a user expected to use the value? With local=false, you get slash-separated files (which is good, because it's consistent) which are either absolute (see my earlier comments about how broken this is in practice) or relative to the "directory" containing the egg-info file (where this is of course not necessarily a real filesystem location). But these aren't useful as filesystem names, because they aren't local format. So what is the point of the get_installed_files, uses, and get_egginfo_files functions? Who will use the returned values, and what for? Where will a user get a value to pass to uses()? Given that get_egginfo_file takes names relative to the egginfo dir, shouldn't get_egginfo_files *return* values relative to the same location, so that they can be used there? I guess the same type of argument applies with get_inetalled_files() and uses() - the values returned by get_installed_files can be seen as a set of opaque tokens (OK, realistically they will be "filenames" in some sense) which can be passed to uses() - but if they aren't absolute filesystem paths, you can't compare them. Example: distribution a contains file p.py distribution b contains file p.py pkgutil.get_distribution(a).get_installed_files contains 'p.py' pkgutil.get_distribution(b).uses('p.py') is True What does this mean? If the 2 distributions are in the same sys.path element, there's a clash. If not, is there a clash? What if they are in 2 different directories? What if they are in the same directory, but it's repeated under different names (e.g. symlinks) on sys.path? What if one is in a directory and the other in a zip file? This isn't even related to non-filesystem loaders. It's a fundamental issue with the whole API. What you seem to want is actually a canonical "name" for a file object - for filesystem files, os.path.normcase(os.path.normpath(filename)) is probably good enough, although it doesn't deal with symlinks. For other types of loader, you have to rely on the loader itself - loader.get_fullname() is probably OK. But even then, it's not clear how user code would actually *use* these APIs. I think you need some real-world use cases, with actual sample (pseudo-)code, to validate the design here. As things stand, it's both confusing and (I suspect) unusable in practice. Sorry, I know that sounds negative, but if this isn't to be a source of subtle bugs for years to come, it needs to be clarified now. PEP 302 is still hitting this type of issue - runpy and importlib have brought out errors and holes in the protocol quite recently - even though Just and I went to great lengths to try to tease out hidden assumptions up front. Concrete proposal: get_metadata_files() - returns slash-separated names, relative to the egginfo dir get_metadata_file(path) - path must be slash-separated, relative to the egginfo dir get_installed_files - returns the contents of RECORD unaltered uses(path) - checks if path is in RECORD The latter 2 are not very useful in practice - you can't say anything about entries in different RECORD files, which is likely the real use case you want. Maybe RECORD could have an extra "Location" entry, which determines where it exists globally (this would be the directory to which the filenames were relative, in the case of filesystem-based distributions) and RECORD entries are comparable if the Location values in the 2 RECORD files match. That's a lot more complex - but depending on what use people expect to make of these 2 APIs, it may be justified. Paul.

2009/7/4 Paul Moore <p.f.moore@gmail.com>:
I just run them from within the directory
If I set PYTHONPATH=. then I get errors. I suspect path normalisation (for backslashes) in the zipfile handling.
Applied, thanks (I didn't run them under win32 yet)
I'll add some more tests then, or even user stories.
Agreed, the zip case was added afterwards, but in practice, the APIs are still dealing with the files are *filesystem files* located in a container (eg a directory or a zip file) located somewhere on the filesystem. "local" in that case is a flag that means "translate a file path expressed in the local filesystem" which make no sense anymore with zip files. But the goal really, is to be able to point out that two distributions are using the very same file. Right now PEP 376 and the prototype code handle these two real world use cases: - browsing regular site-packages-like directories - browsing site-packages-like directories, that are zipped. For example: - I have a "packages.zip" file in /var/, wich is also in my sys.path. It contains a distribution "foo-1.0" that has the "roman.py" file in its root. So the RECORD file located in "foo-1.0.egg-info" has a line starting with "roman.py,..." - Then if I install docutils 0.5 as a regular filesystem distribution, "roman.py" will be added in Python's site-packages. and docutils-0.5.egg-info/RECORD will contain "roman.py,..." with the same hash. The local flag will return these paths: - /var/packages.zip/roman.py <--- not a "real" path - /usr/local/lib/python2.6/site-packages/roman.py So removing the docutils distribution will be doable, because these paths are different.
Yes, In practice, if you look at my previous example, even if "/var/packages.zip/roman.py" isn't a real path, it's enough to compare RECORD entries globally. The "Location" entry you are proposing in that case, would be "/var/packages.zip". But do we really need to store it the RECORD ? Or can't we define an API that returns two elements : - the path to the location (in the example: /var/packages.zip or /usr/local/lib/python2.6/site-packages) - the path within the location itself (in the example: roman.py) A concrete proposal would be to take back your proposal, but return tuples with the location as the first member. e.g. "(location, relative path[s])" The code that is comparing paths to see if they are the same can join location+relative path[s], while we can provide in a dedicated function something to read the content of the file (that would be get_data I guess, if I refer to PEP 302) Tarek

2009/7/5 Tarek Ziadé <ziade.tarek@gmail.com>:
That sounds reasonable. So we can forget the "local" parameter, and return a tuple: - absolute location of the container (directory, zipfile or whatever containing the egginfo file) as a filesystem path in canonical native form (where it's filesystem based) or as an opaque token for the odd cases (frozen modules, for example) where a filesystem location isn't available. - entry from the RECORD file, as a slash-separated filename relative to the root of the container.
Unfortunately, get_data loads data files located within a *package*, using a name relative to the package directory. You can't get at the metadata of a *distribution* like that. But if you're using get_installed_files(), why would you then want to read the files? What exactly would you *use* get_installed_files for which would then leave you needing to read the files? If it's to check they haven't changed (by comparing md5 values) you're doing that to uninstall, so that's the responsibility of the uninstall function. Again, it's a question of what is a public API, and what is the use case it's designed for. I'm currently writing a SQLite importer, which will allow me to store "files" in any sort of database tables I want, so I can build in some nice pathological behaviour. That should tease out some awkward corner cases :-) Paul

On Sun, Jul 5, 2009 at 3:27 PM, Paul Moore<p.f.moore@gmail.com> wrote:
exactly,
Right. These APIs were created for third-party package managers. One use case of a package manager is the uninstallation, but I have no other use case in mind.
Sounds good. Semi-related: even if the files themselves are in the filesystem, having a sqlite db to index the list of installed distributions makes a good cache solution to reduce the disk I/O and speed up the query functions. So maybe we could use a disk-based cache for site-packages-like directories in the form of a sqlite db. That's what I am experimenting on my side.
Paul
-- Tarek Ziadé | http://ziade.org

2009/7/3 Tarek Ziadé <ziade.tarek@gmail.com>:
One important note - I plan on using the fact that DistributionDirMap is not public, and hacking it about drastically, or possibly even removing it. (After all, the likes of the load method don't make sense when you've got sys.meta_path, sys.path_importer_cache and the like to consider). If the loss of any of the "internal" classes is an issue, say so now! Paul.

On Sat, Jul 4, 2009 at 3:04 PM, Paul Moore<p.f.moore@gmail.com> wrote:
Go for it please, and let me know if you set a bitbucket account, so you can push your commits in there directly
Paul.
-- Tarek Ziadé | http://ziade.org

2009/7/5 Tarek Ziadé <ziade.tarek@gmail.com>:
Go for it please, and let me know if you set a bitbucket account, so you can push your commits in there directly
My bitbucket account is 'pmoore'. Paul.
participants (4)
-
Brett Cannon
-
Nick Coghlan
-
Paul Moore
-
Tarek Ziadé