package management - common storage while keeping the versions straight

Hi all, First post here. I have a cluster where the common software is NFS shared from the file server to other nodes. All the python packages are kept in a directory which is referenced by PYTHONPATH. The good part of that is that there is just one copy of each package-version. The bad part, as you have all no doubt guessed, is that python by itself is really bad at specifying and loading a set of particular library versions (see below), so upgrading one program will break another due to conflicting installed versions. Hence the common use of virtualenv's. But as far as I can tell each virtualenv installs a copy of each package-version it needs, resulting in multiple copies of the same package-version for common packages on the same disk. What I am after is some method of keeping exactly one copy of each package-version in the common area (ie, one might find foo-1.2, foo-1.7, and foo-2.3 there), while also presenting only the one version of each (let's say foo-1.7) to a particular installed program. On linux it might do that by making soft links to the common PYTHONPATH area from another directory for which it sets PYTHONPATH for the application. Finally, this has to be usable by any account which has read execute access to the main directory. Does such a beast exist? If so, please point me to it! The limitations of python version handling to which I refer above can be illustrated for "scanpy-scripts"'s dependencies. Given all the needed libraries in one place (plus incompatible versions) the right set can be loaded (and verified) like this: export PYTHONPATH=/path/to_common_area python3 __requires__ = ['scipy <1.3.0,>=1.2.0', 'anndata <0.6.20', 'loompy <3.0.0,>=2.00', 'h5py <2.10'] import pkg_resources import scipy import anndata import loompy import h5py import scanpy print(scipy.__version__) print(anndata.__version__) print(loompy.__version__) print(h5py.__version__) print(scanpy.__version__) quit() which emits exactly the versions scanpy-scripts needs: 1.2.3 0.6.19 2.0.17 2.9.0 1.4.3 However, adding , 'scanpy <1.4.4,>=1.4.2' at the end of __requires__ makes the whole thing fail at import pkg_resources with (many lines deleted) 792, in resolve raise VersionConflict(dist, req).with_context(dependent_req) pkg_resources.ContextualVersionConflict: (scipy 1.2.3 (/home/common/lib/python3.6/site-packages/scipy-1.2.3-py3.6-linux-x86_64.egg), Requirement.parse('scipy>=1.3.1'), {'umap-learn'}) even though the scanpy it loaded in the first case was within the desired range. Moreover, specifying the desired versions as parameters to import pkg_resources does not work at all since pkg_resources only keeps the highest version of each package it finds when imported. (A limitation that never made the least bit of sense to me.) The test system is CentOS 8 with python 3.6.8. Thanks, David Mathog

On Wed, 24 Jun 2020 at 00:00, David Mathog <dmathog@gmail.com> wrote:
Does such a beast exist? If so, please point me to it!
Basically no, or at least not to my knowledge. The mechanisms exist, in the form of import hooks and similar, to build something like this, but it's not proved to be a common enough requirement that there's a well-known/standard library for it. I believe that setuptools (pkg_resources) had a mechanism to do something along these lines, but it never really became popular and I don't know if it's still considered as supported by the setuptools maintainers. So I think you're going to have to either accept the need for multiple copies, or write something specific for your situation. Sorry, Paul

On Tue, 23 Jun 2020, at 23:51, David Mathog wrote:
What I am after is some method of keeping exactly one copy of each package-version in the common area (ie, one might find foo-1.2, foo-1.7, and foo-2.3 there), while also presenting only the one version of each (let's say foo-1.7) to a particular installed program. On linux it might do that by making soft links to the common PYTHONPATH area from another directory for which it sets PYTHONPATH for the application. Finally, this has to be usable by any account which has read execute access to the main directory.
Conda environments work somewhat like this - all the packages are stored in a central place, and the structure of selected ones is replicated using hardlinks in a site-packages directory belonging to the environment. So if your concern is not to waste disk space by storing copies of the same packages, that might be an option. Thomas

On Wed, Jun 24, 2020 at 1:36 AM Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Tue, 23 Jun 2020, at 23:51, David Mathog wrote:
What I am after is some method of keeping exactly one copy of each package-version in the common area (ie, one might find foo-1.2, foo-1.7, and foo-2.3 there), while also presenting only the one version of each (let's say foo-1.7) to a particular installed program. Conda environments work somewhat like this - all the packages are stored in a central place, and the structure of selected ones is replicated using hardlinks in a site-packages directory belonging to the environment. So if your concern is not to waste disk space by storing copies of the same packages, that might be an option.
I experimented with that one a little. It installs its own copies of python and things like openssl and openblas which are already present from the linux distribution. Similarly, if some python script needs "bwa" it will install its own even though that program is already available. Basically it is yet another "replicate everything we might need whether or not it is already present" type of solution. (The extreme end of that spectrum are systems like docker, which effectively replaces the entire OS.) So there might be only the one version of each python package (not counting duplicates with the OS's python3) but now there are also duplicate copies of system libraries and utilities. I think I will experiment a little with pipenv and if necessary after each package install use a script to remove the installed libraries and replace them with a hard link to the one in the common area. Maybe it will be possible to put in those links before installing the package of interest (like for scanpy, see first post), which will hopefully keep it from having to rebuild all those packages too. Thanks, David Mathog

On 24Jun2020 1923, David Mathog wrote:
I think I will experiment a little with pipenv and if necessary after each package install use a script to remove the installed libraries and replace them with a hard link to the one in the common area. Maybe it will be possible to put in those links before installing the package of interest (like for scanpy, see first post), which will hopefully keep it from having to rebuild all those packages too.
Here's a recent discussion about this exact idea (with a link to an earlier discussion on this list): https://discuss.python.org/t/proposal-sharing-distrbution-installations-in-g... It's totally possible, though it's always a balance of trade-offs. Some of the people on that post may be interested in developing a tool to automate parts of the process. Cheers, Steve

Thanks for the link. Unfortunately there was not a reference to a completed package that actually did this. As in, I really do not want to reinvent the wheel. Ugh, sorry, that's a pun in this context. Here is a first shot at this, just installing a moderately complicated package in a virtualenv and then reinstalling it in another virtualenv. Extract and execinput are my own programs (from drm_tools on sourceforge) but it is obvious from the context what they are doing. The links had to be soft because linux does not actually allow a normal user (or maybe even root) to make a hard link to a directory. cd /usr/common/lib/python3.6/Envs rm -rf ~/.cache/pip #make download clearer python3 -m venv scanpy source scanpy/bin/activate python -m pip install -U pip #update 9.0.3 to 20.1.1 which python3 #using the one in scanpy pip3 install scanpy scanpy -h #seems to start deactivate rm -rf ~/.cache/pip #make download clearer python3 -m venv scanpy2 source scanpy2/bin/activate python -m pip install -U pip #update 9.0.3 to 20.1.1 export DST=/usr/common/lib/python3.6/Envs/scanpy/lib/python3.6/site-packages export SRC=/usr/common/lib/python3.6/Envs/scanpy2/lib/python3.6/site-packages ls -1 $DST \ | grep -v __pycache__ \ | grep -v scanpy \ | grep -v easy_install.py \ | extract -fmt "ln -s $DST/[1,] $SRC/[1,]" \ | execinput pip3 install scanpy #downloaded scanpy, "Requirement already satisfied" for all the others #Installing collected packages: scanpy # Successfully installed scanpy-1.5.1 scanpy -h #seems to start deactivate source scanpy/bin/activate scanpy -h #seems to start (still) deactivate So that method seems to have some promise. It saved a considerable amount of space too: du -k scanpy | tail -1 457408 scanpy du -k scanpy2 | tail -1 24900 scanpy2 However, two potential problems are evident on inspection. The first is that when the 2nd scanpy installation was performed it updated the dates on all the directories in $DST. A workaround would be to copy all of those directories into the virtualenv temporarily, just for the installation, and then remove them and put the links in afterwards. That strikes me as awfully cludgy. Setting them read only would likely break the install. The second issue is that each package install creates two directories like: llvmlite llvmlite-0.33.0.dist-info where the latter contains top_level.txt which in turn contains one line: llvmlite pointing to the first directory. If another version must cohabit with it the "llvmlite" directories will conflict. For this sort of approach to work easily the llvmlite directory should be named "llvmlite-0.33.0" and top_level.txt should reference that too. It would be possible (probably) to work around it though by having llvmlite-0.33.0 only in the common area and use: ln -s $COMMON/llvmlite-0.33.0 $VENVAREA/llvmlite The top_level.txt in each could then reference the unversioned name. Unknown if this soft link approach will work on Windows. Regards, David Mathog On Wed, Jun 24, 2020 at 1:26 PM Steve Dower <steve.dower@python.org> wrote:
On 24Jun2020 1923, David Mathog wrote:
I think I will experiment a little with pipenv and if necessary after each package install use a script to remove the installed libraries and replace them with a hard link to the one in the common area. Maybe it will be possible to put in those links before installing the package of interest (like for scanpy, see first post), which will hopefully keep it from having to rebuild all those packages too.
Here's a recent discussion about this exact idea (with a link to an earlier discussion on this list): https://discuss.python.org/t/proposal-sharing-distrbution-installations-in-g...
It's totally possible, though it's always a balance of trade-offs. Some of the people on that post may be interested in developing a tool to automate parts of the process.
Cheers, Steve

On Tue, 2020-06-23 at 15:51 -0700, David Mathog wrote:
What I am after is some method of keeping exactly one copy of each package-version in the common area (ie, one might find foo-1.2, foo-1.7, and foo-2.3 there), while also presenting only the one version of each (let's say foo-1.7) to a particular installed program. On linux it might do that by making soft links to the common PYTHONPATH area from another directory for which it sets PYTHONPATH for the application. Finally, this has to be usable by any account which has read execute access to the main directory.
Does such a beast exist? If so, please point me to it!
I have been meaning to do something like this for a while now! But unfortunately I can't find the time. If you do choose of start implementing it, please let me know. I would be happy to help out. Cheers, Filipe Laíns

It turned out that the second install was not the cause of the timestamp change in the original. On reviewing "history" it turned out that I had accidentally run the link generation twice. That turned up this (for me) unexpected behavior: mkdir /tmp/foo ls -al /tmp/foo total 16 drwxrwxr-x. 2 modules modules 6 Jun 24 16:49 . drwxrwxrwt. 173 root root 12288 Jun 24 16:49 .. ln -s /tmp/foo /tmp/bar ls -al /tmp/foo drwxrwxr-x. 2 modules modules 6 Jun 24 16:49 . drwxrwxrwt. 173 root root 12288 Jun 24 16:49 .. ln -s /tmp/foo /tmp/bar ls -al /tmp/foo total 16 drwxrwxr-x. 2 modules modules 17 Jun 24 16:51 . drwxrwxrwt. 173 root root 12288 Jun 24 16:50 .. lrwxrwxrwx. 1 modules modules 8 Jun 24 16:51 foo -> /tmp/foo The repeated soft link actually put a file under the target. Strange. Apparently it is expected behavior. The problem can be avoided by using this form: ln -sn $TARGET $LINK The later installs are much faster than the first one, since putting in the links is very fast and building the packages is not. This was the trivial case though, since having done one install all the prerequisites were just "there". The johnnydep package will list the dependencies without doing the install. Guess I will throw something together based on that and the above results and see how it goes. Regards, David Mathog On Wed, Jun 24, 2020 at 4:23 PM Filipe Laíns <filipe.lains@gmail.com> wrote:
On Tue, 2020-06-23 at 15:51 -0700, David Mathog wrote:
What I am after is some method of keeping exactly one copy of each package-version in the common area (ie, one might find foo-1.2, foo-1.7, and foo-2.3 there), while also presenting only the one version of each (let's say foo-1.7) to a particular installed program. On linux it might do that by making soft links to the common PYTHONPATH area from another directory for which it sets PYTHONPATH for the application. Finally, this has to be usable by any account which has read execute access to the main directory.
Does such a beast exist? If so, please point me to it!
I have been meaning to do something like this for a while now! But unfortunately I can't find the time.
If you do choose of start implementing it, please let me know. I would be happy to help out.
Cheers, Filipe Laíns

On Thu, 25 Jun 2020 at 00:06, David Mathog <dmathog@gmail.com> wrote:
Thanks for the link. Unfortunately there was not a reference to a completed package that actually did this. As in, I really do not want to reinvent the wheel. Ugh, sorry, that's a pun in this context.
I think the key message here is that you won't be *re*-inventing the wheel. This is a wheel that still needs to be invented. Paul (It was *way* too hard trying to write the above without tripping over the extended "wheel" pun ;-))

On Thu, Jun 25, 2020 at 12:37 AM Paul Moore <p.f.moore@gmail.com> wrote:
I think the key message here is that you won't be *re*-inventing the wheel. This is a wheel that still needs to be invented.
It _was_ invented, but it is off round and gives a rough ride. As noted in the first post this: __requires__ = ['scipy <1.3.0,>=1.2.0', 'anndata <0.6.20', 'loompy <3.0.0,>=2.00', 'h5py <2.10'] import pkg_resources was able to load the desired set of package-versions for scanpy, but setting a version number constraint on scanpy itself at the end of that list, one which matched the version that the preceding commands successfully loaded, broke it. So it is not reliable. And the entire __requires__ kludge is only present because for reasons beyond my pay grade this: import pkg_resources pkg_resources.require("scipy<1.3.0,>=1.2.0;anndata<0.6.20;etc.") import scipy import anndata #etc. cannot work because by default "import pkg_resources" keeps only the most recent version rather than making up a tree (or list or hash or whatever) and waiting to see if there are any version constraints to be applied at the time of actual package import. What I'm doing now is basically duct tape and bailing wire to work around those deeper issues. In terms of language design, a much better fix would be to modify pkg_resources so that it will always successfully load the required versions from a designated directory which contains multiple versions of packages, and modify the package maintenance tools so that they can maintain such a directory. Regards, David Mathog

Questions about naming conventions. The vast majority of packages when they install create in site-packages two directories with names like: foobar foobar-1.2.3.dist-info (or egg-info) However PyYAML creates: yaml PyYAML-5.3.1-py3.6.egg-info and there is also this: pkg_resources which is not associated with a versioned package. In python3
import yaml import pkg_resources print(yaml.__version__) 5.3.1 print(pkg_resources.__version__) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: module 'pkg_resources' has no attribute '__version__'
So by what method could code working outside of python possibly determine that "yaml" goes with "PyYAML"? Is this a common situation? Is pkg_resources actually a package? Does it make sense for a common package repository to have a single instance of this directory or should each installed python based program retain its own version of this? There are some other files that live in site-packages which are not actually packages. The list so far is: __pycache__ #some dynamic libraries, like kiwisolver.cpython-36m-x86_64-linux-gnu.so #some pth files, but always so far with an explicit version number, like sphinxcontrib_applehelp-1.0.2-py3.8-nspkg.pth #or associated with a package with a version number like: setuptools setuptools-46.1.3.dist-info setuptools.pth #some py files, apparently when that package does not make a corresponding #directory like: zipp-3.1.0.dist-info zipp.py #initialization file "site" as site.py site.pyc Any others to look out for? That is, files which might be installed in site-packages but which should not be shared. Hopefully this next is an appropriate question for this list, since the issue arises from how python loads packages. Is there any way to avoid collisions between python based programs other than activating and deactivating their virtualenvs, or redefining PYTHONPATH, before each is used? Programs that have the property that their library loading is determinate (usually the case with C, fortran, bash scripts, etc.)one can construct a bash script (for instance) which runs 3 programs in order like so: prog1 prog2 prog3 # spawns subprocesses which run prog2 and prog1 and there are not generally any issues. (Yes, one can create a mess with LD_PRELOAD and the like.) But if those 3 are python programs unless prog1, prog2, prog3 are all built into the same virtualenv, which usually means they come from the same software distribution, I don't see how to avoid conflicts for the first two cases without activating/deactivating each one, which looks like it might be tricky in the 3rd case. If one has a directory like: TOP/bin/prog TOP/lib/python3.6/site-packages Other than using PYTHONPATH to direct to it with an absolute path, is there any way to force prog to only import from that specific site-packages? Let me try that again. Is there a way to tell prog via any environmental variable to look in "../lib/python3.6/site-packages" (and nowhere else) for imports, with the reference directory being that where prog is installed, not where the process PWD might happen to be. Because if that was possible it might allow a sort of "set it and forget it" method like export PYTHONRELPATHFROMPROG="../lib/python3.6/site-packages prog1 #uses prog1 site-package prog2 #uses prog2 site-package prog3 #uses prog3 site-package # prog1 subprocess #uses prog1 site-package # prog2 subprocess #uses prog2 site-package (None of which would be necessary if python programs could import specific versions reliably from a common directory containing multiple versions of each package.) Thanks, David Mathog On Thu, Jun 25, 2020 at 10:46 AM David Mathog <dmathog@gmail.com> wrote:
On Thu, Jun 25, 2020 at 12:37 AM Paul Moore <p.f.moore@gmail.com> wrote:
I think the key message here is that you won't be *re*-inventing the wheel. This is a wheel that still needs to be invented.
It _was_ invented, but it is off round and gives a rough ride. As noted in the first post this:
__requires__ = ['scipy <1.3.0,>=1.2.0', 'anndata <0.6.20', 'loompy <3.0.0,>=2.00', 'h5py <2.10'] import pkg_resources
was able to load the desired set of package-versions for scanpy, but setting a version number constraint on scanpy itself at the end of that list, one which matched the version that the preceding commands successfully loaded, broke it. So it is not reliable.
And the entire __requires__ kludge is only present because for reasons beyond my pay grade this:
import pkg_resources pkg_resources.require("scipy<1.3.0,>=1.2.0;anndata<0.6.20;etc.") import scipy import anndata #etc.
cannot work because by default "import pkg_resources" keeps only the most recent version rather than making up a tree (or list or hash or whatever) and waiting to see if there are any version constraints to be applied at the time of actual package import.
What I'm doing now is basically duct tape and bailing wire to work around those deeper issues. In terms of language design, a much better fix would be to modify pkg_resources so that it will always successfully load the required versions from a designated directory which contains multiple versions of packages, and modify the package maintenance tools so that they can maintain such a directory.
Regards,
David Mathog

On Fri, Jun 26, 2020 at 12:43 PM David Mathog <dmathog@gmail.com> wrote:
So by what method could code working outside of python possibly determine that "yaml" goes with "PyYAML"?
Sorry, I forgot that the information was in PyYAML-5.3.1-py3.6.egg-info/top_level.txt Still, how common is that? Can anybody offer an estimate about what fraction of packages use different names like that? Thanks, David Mathog Is this a common situation?
Is pkg_resources actually a package? Does it make sense for a common package repository to have a single instance of this directory or should each installed python based program retain its own version of this?
There are some other files that live in site-packages which are not actually packages. The list so far is:
__pycache__
#some dynamic libraries, like kiwisolver.cpython-36m-x86_64-linux-gnu.so
#some pth files, but always so far with an explicit version number, like sphinxcontrib_applehelp-1.0.2-py3.8-nspkg.pth #or associated with a package with a version number like: setuptools setuptools-46.1.3.dist-info setuptools.pth
#some py files, apparently when that package does not make a corresponding #directory like: zipp-3.1.0.dist-info zipp.py
#initialization file "site" as site.py site.pyc
Any others to look out for? That is, files which might be installed in site-packages but which should not be shared.
Hopefully this next is an appropriate question for this list, since the issue arises from how python loads packages. Is there any way to avoid collisions between python based programs other than activating and deactivating their virtualenvs, or redefining PYTHONPATH, before each is used? Programs that have the property that their library loading is determinate (usually the case with C, fortran, bash scripts, etc.)one can construct a bash script (for instance) which runs 3 programs in order like so:
prog1 prog2 prog3 # spawns subprocesses which run prog2 and prog1
and there are not generally any issues. (Yes, one can create a mess with LD_PRELOAD and the like.) But if those 3 are python programs unless prog1, prog2, prog3 are all built into the same virtualenv, which usually means they come from the same software distribution, I don't see how to avoid conflicts for the first two cases without activating/deactivating each one, which looks like it might be tricky in the 3rd case.
If one has a directory like:
TOP/bin/prog TOP/lib/python3.6/site-packages
Other than using PYTHONPATH to direct to it with an absolute path, is there any way to force prog to only import from that specific site-packages? Let me try that again. Is there a way to tell prog via any environmental variable to look in "../lib/python3.6/site-packages" (and nowhere else) for imports, with the reference directory being that where prog is installed, not where the process PWD might happen to be. Because if that was possible it might allow a sort of "set it and forget it" method like
export PYTHONRELPATHFROMPROG="../lib/python3.6/site-packages prog1 #uses prog1 site-package prog2 #uses prog2 site-package prog3 #uses prog3 site-package # prog1 subprocess #uses prog1 site-package # prog2 subprocess #uses prog2 site-package
(None of which would be necessary if python programs could import specific versions reliably from a common directory containing multiple versions of each package.)
Thanks,
David Mathog
On Thu, Jun 25, 2020 at 10:46 AM David Mathog <dmathog@gmail.com> wrote:
On Thu, Jun 25, 2020 at 12:37 AM Paul Moore <p.f.moore@gmail.com> wrote:
I think the key message here is that you won't be *re*-inventing the wheel. This is a wheel that still needs to be invented.
It _was_ invented, but it is off round and gives a rough ride. As noted in the first post this:
__requires__ = ['scipy <1.3.0,>=1.2.0', 'anndata <0.6.20', 'loompy <3.0.0,>=2.00', 'h5py <2.10'] import pkg_resources
was able to load the desired set of package-versions for scanpy, but setting a version number constraint on scanpy itself at the end of that list, one which matched the version that the preceding commands successfully loaded, broke it. So it is not reliable.
And the entire __requires__ kludge is only present because for reasons beyond my pay grade this:
import pkg_resources pkg_resources.require("scipy<1.3.0,>=1.2.0;anndata<0.6.20;etc.") import scipy import anndata #etc.
cannot work because by default "import pkg_resources" keeps only the most recent version rather than making up a tree (or list or hash or whatever) and waiting to see if there are any version constraints to be applied at the time of actual package import.
What I'm doing now is basically duct tape and bailing wire to work around those deeper issues. In terms of language design, a much better fix would be to modify pkg_resources so that it will always successfully load the required versions from a designated directory which contains multiple versions of packages, and modify the package maintenance tools so that they can maintain such a directory.
Regards,
David Mathog

(Sending to the list this time.) On 2020 Jun 26, at 15:43, David Mathog <dmathog@gmail.com> wrote:
So by what method could code working outside of python possibly determine that "yaml" goes with "PyYAML"?
By checking all *.dist-info/RECORD files to see which one mentions the "yaml" directory. (top_level.txt could also be checked, but I believe that only setuptools creates this file — projects built with flit or poetry don't have it — and it's not very helpful when namespace packages are involved.)
Is this a common situation?
It happens whenever the project "foo" distributes a module named something other than "foo". Other projects like this that I can think of off the top of my head are BeautifulSoup4 (module: bs4), python-dateutil (module: dateutil), and attrs (module: attr).
Is pkg_resources actually a package?
pkg_resources is a module distributed by the setuptools project (alongside the modules "setuptools" and "easy_install").
Does it make sense for a common package repository to have a single instance of this directory or should each installed python based program retain its own version of this?
There should be one instance per each version of setuptools stored in the repository. -- John Wodder

On 2020 Jun 26, at 15:50, David Mathog <dmathog@gmail.com> wrote:
Still, how common is that? Can anybody offer an estimate about what fraction of packages use different names like that?
Scanning through the wheelodex.org database (specifically, a dump from earlier this week) finds 32,517 projects where the wheel DOES NOT contain a top-level module of the same name as the project (after correcting for differences in case and hyphen vs. underscore vs. period) and 74,073 projects where the wheel DOES contain a module of the same name. (5,417 projects containing no modules were excluded.) Note that a project named "foo-bar" containing a namespace package "foo/bar" is counted in the former group. Of the 32,517 non-matching projects, 7,117 were Odoo projects with project names of the form "odoo{version}_addon_{foo}" containing namespace modules of the form "odoo/addons/{foo}", and 3,175 were Django projects with project names of the form "django_{foo}" containing packages named just "{foo}". No other major patterns seem to stand out. -- John Wodder

Thanks for that feedback. Looks like RECORD is the one to use. The names of the directories ending in dist-info seem to be uniformly: package-version.dist_info but the directory names associated with eggs come in a lot of flavors: anndata-0.6.19-py3.6.egg cutadapt-2.10.dev20+g93fb340-py3.6-linux-x86_64.egg scanpy-1.5.2.dev7+ge33a2f33-py3.6.egg h5py-2.9.0-py3.6-linux-x86_64.egg simplejson-3.17.0-py3.6.egg-info johnnydep does not give any hints that this is coming: johnnydep --output-format pinned h5py #relevant part: h5py==2.10.0 What would be some small examples for other package managers, I would like to see what they have as equivalents to dist-info and egg-info so that the script does not choke on it. Some progress with the test script. It can now convert a virtualenv to a regular directory and migrate the site-packages contents to a shared area. A second migration of a copy of the same virtualenv to a different regular directory correctly makes links to the first set. (That is, two normal directories both linked to one common set of packages.) And the test program (johnnydep) runs in both with PYTHONPATH set correctly. But preinstalling, that is setting links to the common directory before doing a normal install is tricky because of the name inconsistencies. To do that it must run johnnydep to get the necessary information, and that is not very fast. A normal install of johnnydep itself, complete with downloads, takes less time than that programs own analysis! time johnnydep johnnydep #21s vs. rm -rf ~/.cache/pip #force actual downloads #too fast to measure time python3 -m venv johnnydep #2.3s source johnnydep/bin/activate #too fast to measure time python -m pip install -U pip #update 9.0.3 to 20.1.1 #3.4s time pip3 install johnnydep #7.8s Probably a package with a huge amount of compilation would be a win for a preinstall, but it is at this point definitely not an "always faster" option. Thanks, David Mathog On Fri, Jun 26, 2020 at 2:51 PM John Thorvald Wodder II <jwodder@gmail.com> wrote:
On 2020 Jun 26, at 15:50, David Mathog <dmathog@gmail.com> wrote:
Still, how common is that? Can anybody offer an estimate about what fraction of packages use different names like that?
Scanning through the wheelodex.org database (specifically, a dump from earlier this week) finds 32,517 projects where the wheel DOES NOT contain a top-level module of the same name as the project (after correcting for differences in case and hyphen vs. underscore vs. period) and 74,073 projects where the wheel DOES contain a module of the same name. (5,417 projects containing no modules were excluded.) Note that a project named "foo-bar" containing a namespace package "foo/bar" is counted in the former group.
Of the 32,517 non-matching projects, 7,117 were Odoo projects with project names of the form "odoo{version}_addon_{foo}" containing namespace modules of the form "odoo/addons/{foo}", and 3,175 were Django projects with project names of the form "django_{foo}" containing packages named just "{foo}". No other major patterns seem to stand out.
-- John Wodder -- Distutils-SIG mailing list -- distutils-sig@python.org To unsubscribe send an email to distutils-sig-leave@python.org https://mail.python.org/mailman3/lists/distutils-sig.python.org/ Message archived at https://mail.python.org/archives/list/distutils-sig@python.org/message/V445K...

On Sat, 27 Jun 2020 at 01:37, David Mathog <dmathog@gmail.com> wrote:
Thanks for that feedback. Looks like RECORD is the one to use.
The names of the directories ending in dist-info seem to be uniformly:
package-version.dist_info
Note that if you're doing something like this, you should probably read PEP 376 (https://www.python.org/dev/peps/pep-0376/) which defines the standard layout of installed packages.
but the directory names associated with eggs come in a lot of flavors:
anndata-0.6.19-py3.6.egg cutadapt-2.10.dev20+g93fb340-py3.6-linux-x86_64.egg scanpy-1.5.2.dev7+ge33a2f33-py3.6.egg h5py-2.9.0-py3.6-linux-x86_64.egg simplejson-3.17.0-py3.6.egg-info
The egg format is an older format that was never standardised, so details of that format are likely somewhere in the setuptools documentation. .egg-info directories are the older equivalent of dist-info directories, but egg directories are a very different format (they contain the full distribution plus metadata in one directory). You;d have to find the setuptools documentation of the egg format for that. (Note that the egg format is obsolete, so you may need to look at older documentation - I don't know if the current setuptools docs describe the format). I'm not aware what other formats tools like conda use, sorry. Paul

On Fri, Jun 26, 2020 at 2:51 PM John Thorvald Wodder II <jwodder@gmail.com> wrote:
Of the 32,517 non-matching projects, 7,117 were Odoo projects with project names of the form "odoo{version}_addon_{foo}" containing namespace modules of the form "odoo/addons/{foo}", and 3,175 were Django projects with project names of the form "django_{foo}" containing packages named just "{foo}". No other major patterns seem to stand out.
In CentOS 8 the RPM python3-rhnlib-2.8.6-8.module_el8.1.0+211+ad6c0bc7.noarch has loaded into the directory /usr/lib/python3.6/site-packages two entries rhn # a directory rhnlib-2.8.6-py3.6.egg-info #a file The latter contains just this text: Metadata-Version: 1.0 Name: rhnlib Version: 2.8.6 Summary: Python libraries for the Spacewalk project Home-page: http://rhn.redhat.com Author: Mihai Ibanescu Author-email: misa@redhat.com License: GPL Description: rhnlib is a collection of python modules used by the Spacewalk (http://spacewalk.redhat.com) software. Platform: UNKNOWN Nor is there a link in the other direction: grep -iR rhnlib /usr/lib/python3.6/site-packages/rhn #nothing So while "rhn" bears a similarity to "rhnlib" it is neither the package name nor is it listed in the egg-info. This was of course installed by dnf (AKA yum) and not by egg. Is it possible for any python installer (as opposed to dnf, which runs outside of it) to install an unreferenced directory like this? Presumably not with a dist-info, but with an egg-info that does not in any way reference the active part of the installation? In a small collection (172 packages) here these were the only two "file" egg-info entries found, with their associated directories: busco BUSCO-4.0.6-py3.6.egg-info ngs ngs-1.0-py3.6.egg-info In neither case does the egg-info file reference the corresponding directory, but at least the directory in both has the expected package name (other than case). In the examples you cited at the top, were any of those "different name" cases from packages with a "file" egg-info? Thanks, David Mathog

On 2020 Jun 29, at 16:09, David Mathog <dmathog@gmail.com> wrote:
In neither case does the egg-info file reference the corresponding directory, but at least the directory in both has the expected package name (other than case). In the examples you cited at the top, were any of those "different name" cases from packages with a "file" egg-info?
The projects I examined were all in wheel form and thus had *.dist-info directories instead of *.egg-info. I know very little about how eggs work, other than that they're deprecated and should be avoided in favor of wheels. -- John Wodder

Hi all. "Python devirtualizer" is a preliminary implementation which manages shared packages so that only one copy of each package version is required. It installs into a virtualenv, then migrates the contents out into the normal OS environment, and while so doing, replaces what would be duplicate files with soft links to a single copy. It is downloadable from here: https://sourceforge.net/projects/python-devirtualizer/ It is linux (or other POSIX like system, __maybe__ Mac) specific. No way it will run on Windows at this point because the main script is bash and the paths assume POSIX path syntax. (Might work in Mingw64 though.) Anyway, pdvctrl install packageA pdvctrl migrate packageA /wherever/packageA pdvctrl install packageB pdvctrl migrate packageB /wherever/packageB will result in a single copy of the shared dependencies on this system, with both packageA and packageB hooked to them with soft links. The import does not go awry because from within each package's site-packages directory there are only links to the files it needs, so it never sees any conflicting package versions. There is also: pdvctrl preinstall packageC pdvctrl install packageC pdvctrl migrate packageC /wherever/packageC which first uses johnnydep to look up dependencies already on the system and links those in directly before going on to install any pieces not so installed. Unfortunately the johnnydep runs with "preinstall" have so far been significantly slower than just doing a normal install and letting the migrate throw out the extra copy. On the other hand, the one package I have encountered which has conflicting requirements (scanpy-scripts) fails in a more comprehensible manner with "preinstall" than with "install". Migrate "wraps" the files in the package's "bin" directory, if any, so that they may be invoked solely by PATH like a regular program. This uses libSDL2 to get the absolute path of the wrapper program, and it defines PYTHONPATH before execve() to the actual target. So no messing about with PYTHONPATH in the user's shell or in scripts. So far I have not run into a problem with the wrappers, which essentially just inject a PYTHONPATH into the environment when the program is run. Well, one package (busco) had a file with no terminal EOL, which resulted in its last line being dropped while it was being wrapped, but that case is now handled. I do expect though at some point to encounter a package which has several files in its bin, and first_program will contain some variant of: python3 /wherever/bin/second_program The wrapper will break those, since the wrapper is a regular binary and not a python script. Regards, David Mathog On Mon, Jun 29, 2020 at 1:43 PM John Thorvald Wodder II <jwodder@gmail.com> wrote:
On 2020 Jun 29, at 16:09, David Mathog <dmathog@gmail.com> wrote:
In neither case does the egg-info file reference the corresponding directory, but at least the directory in both has the expected package name (other than case). In the examples you cited at the top, were any of those "different name" cases from packages with a "file" egg-info?
The projects I examined were all in wheel form and thus had *.dist-info directories instead of *.egg-info. I know very little about how eggs work, other than that they're deprecated and should be avoided in favor of wheels.
-- John Wodder -- Distutils-SIG mailing list -- distutils-sig@python.org To unsubscribe send an email to distutils-sig-leave@python.org https://mail.python.org/mailman3/lists/distutils-sig.python.org/ Message archived at https://mail.python.org/archives/list/distutils-sig@python.org/message/DMRPH...
participants (6)
-
David Mathog
-
Filipe Laíns
-
John Thorvald Wodder II
-
Paul Moore
-
Steve Dower
-
Thomas Kluyver