package management - common storage while keeping the versions straight
Hi all, First post here. I have a cluster where the common software is NFS shared from the file server to other nodes. All the python packages are kept in a directory which is referenced by PYTHONPATH. The good part of that is that there is just one copy of each package-version. The bad part, as you have all no doubt guessed, is that python by itself is really bad at specifying and loading a set of particular library versions (see below), so upgrading one program will break another due to conflicting installed versions. Hence the common use of virtualenv's. But as far as I can tell each virtualenv installs a copy of each package-version it needs, resulting in multiple copies of the same package-version for common packages on the same disk. What I am after is some method of keeping exactly one copy of each package-version in the common area (ie, one might find foo-1.2, foo-1.7, and foo-2.3 there), while also presenting only the one version of each (let's say foo-1.7) to a particular installed program. On linux it might do that by making soft links to the common PYTHONPATH area from another directory for which it sets PYTHONPATH for the application. Finally, this has to be usable by any account which has read execute access to the main directory. Does such a beast exist? If so, please point me to it! The limitations of python version handling to which I refer above can be illustrated for "scanpy-scripts"'s dependencies. Given all the needed libraries in one place (plus incompatible versions) the right set can be loaded (and verified) like this: export PYTHONPATH=/path/to_common_area python3 __requires__ = ['scipy <1.3.0,>=1.2.0', 'anndata <0.6.20', 'loompy <3.0.0,>=2.00', 'h5py <2.10'] import pkg_resources import scipy import anndata import loompy import h5py import scanpy print(scipy.__version__) print(anndata.__version__) print(loompy.__version__) print(h5py.__version__) print(scanpy.__version__) quit() which emits exactly the versions scanpy-scripts needs: 1.2.3 0.6.19 2.0.17 2.9.0 1.4.3 However, adding , 'scanpy <1.4.4,>=1.4.2' at the end of __requires__ makes the whole thing fail at import pkg_resources with (many lines deleted) 792, in resolve raise VersionConflict(dist, req).with_context(dependent_req) pkg_resources.ContextualVersionConflict: (scipy 1.2.3 (/home/common/lib/python3.6/site-packages/scipy-1.2.3-py3.6-linux-x86_64.egg), Requirement.parse('scipy>=1.3.1'), {'umap-learn'}) even though the scanpy it loaded in the first case was within the desired range. Moreover, specifying the desired versions as parameters to import pkg_resources does not work at all since pkg_resources only keeps the highest version of each package it finds when imported. (A limitation that never made the least bit of sense to me.) The test system is CentOS 8 with python 3.6.8. Thanks, David Mathog
On Wed, 24 Jun 2020 at 00:00, David Mathog
Does such a beast exist? If so, please point me to it!
Basically no, or at least not to my knowledge. The mechanisms exist, in the form of import hooks and similar, to build something like this, but it's not proved to be a common enough requirement that there's a well-known/standard library for it. I believe that setuptools (pkg_resources) had a mechanism to do something along these lines, but it never really became popular and I don't know if it's still considered as supported by the setuptools maintainers. So I think you're going to have to either accept the need for multiple copies, or write something specific for your situation. Sorry, Paul
On Tue, 23 Jun 2020, at 23:51, David Mathog wrote:
What I am after is some method of keeping exactly one copy of each package-version in the common area (ie, one might find foo-1.2, foo-1.7, and foo-2.3 there), while also presenting only the one version of each (let's say foo-1.7) to a particular installed program. On linux it might do that by making soft links to the common PYTHONPATH area from another directory for which it sets PYTHONPATH for the application. Finally, this has to be usable by any account which has read execute access to the main directory.
Conda environments work somewhat like this - all the packages are stored in a central place, and the structure of selected ones is replicated using hardlinks in a site-packages directory belonging to the environment. So if your concern is not to waste disk space by storing copies of the same packages, that might be an option. Thomas
On Wed, Jun 24, 2020 at 1:36 AM Thomas Kluyver
On Tue, 23 Jun 2020, at 23:51, David Mathog wrote:
What I am after is some method of keeping exactly one copy of each package-version in the common area (ie, one might find foo-1.2, foo-1.7, and foo-2.3 there), while also presenting only the one version of each (let's say foo-1.7) to a particular installed program. Conda environments work somewhat like this - all the packages are stored in a central place, and the structure of selected ones is replicated using hardlinks in a site-packages directory belonging to the environment. So if your concern is not to waste disk space by storing copies of the same packages, that might be an option.
I experimented with that one a little. It installs its own copies of python and things like openssl and openblas which are already present from the linux distribution. Similarly, if some python script needs "bwa" it will install its own even though that program is already available. Basically it is yet another "replicate everything we might need whether or not it is already present" type of solution. (The extreme end of that spectrum are systems like docker, which effectively replaces the entire OS.) So there might be only the one version of each python package (not counting duplicates with the OS's python3) but now there are also duplicate copies of system libraries and utilities. I think I will experiment a little with pipenv and if necessary after each package install use a script to remove the installed libraries and replace them with a hard link to the one in the common area. Maybe it will be possible to put in those links before installing the package of interest (like for scanpy, see first post), which will hopefully keep it from having to rebuild all those packages too. Thanks, David Mathog
On 24Jun2020 1923, David Mathog wrote:
I think I will experiment a little with pipenv and if necessary after each package install use a script to remove the installed libraries and replace them with a hard link to the one in the common area. Maybe it will be possible to put in those links before installing the package of interest (like for scanpy, see first post), which will hopefully keep it from having to rebuild all those packages too.
Here's a recent discussion about this exact idea (with a link to an earlier discussion on this list): https://discuss.python.org/t/proposal-sharing-distrbution-installations-in-g... It's totally possible, though it's always a balance of trade-offs. Some of the people on that post may be interested in developing a tool to automate parts of the process. Cheers, Steve
Thanks for the link. Unfortunately there was not a reference to a
completed package that actually did this. As in, I really do not want
to reinvent the wheel. Ugh, sorry, that's a pun in this context.
Here is a first shot at this, just installing a moderately complicated
package in a virtualenv and then reinstalling it in another
virtualenv. Extract and execinput are my own programs (from drm_tools
on sourceforge) but it is obvious from the context what they are
doing. The links had to be soft because linux does not actually allow
a normal user (or maybe even root) to make a hard link to a directory.
cd /usr/common/lib/python3.6/Envs
rm -rf ~/.cache/pip #make download clearer
python3 -m venv scanpy
source scanpy/bin/activate
python -m pip install -U pip #update 9.0.3 to 20.1.1
which python3 #using the one in scanpy
pip3 install scanpy
scanpy -h #seems to start
deactivate
rm -rf ~/.cache/pip #make download clearer
python3 -m venv scanpy2
source scanpy2/bin/activate
python -m pip install -U pip #update 9.0.3 to 20.1.1
export DST=/usr/common/lib/python3.6/Envs/scanpy/lib/python3.6/site-packages
export SRC=/usr/common/lib/python3.6/Envs/scanpy2/lib/python3.6/site-packages
ls -1 $DST \
| grep -v __pycache__ \
| grep -v scanpy \
| grep -v easy_install.py \
| extract -fmt "ln -s $DST/[1,] $SRC/[1,]" \
| execinput
pip3 install scanpy
#downloaded scanpy, "Requirement already satisfied" for all the others
#Installing collected packages: scanpy
# Successfully installed scanpy-1.5.1
scanpy -h #seems to start
deactivate
source scanpy/bin/activate
scanpy -h #seems to start (still)
deactivate
So that method seems to have some promise. It saved a considerable
amount of space too:
du -k scanpy | tail -1
457408 scanpy
du -k scanpy2 | tail -1
24900 scanpy2
However, two potential problems are evident on inspection.
The first is that when the 2nd scanpy installation was performed it
updated the dates on all the directories in $DST. A workaround would
be to copy all of those directories into the virtualenv temporarily,
just for the installation, and then remove them and put the links in
afterwards. That strikes me as awfully cludgy. Setting them read
only would likely break the install.
The second issue is that each package install creates two directories like:
llvmlite
llvmlite-0.33.0.dist-info
where the latter contains top_level.txt which in turn contains one line:
llvmlite
pointing to the first directory.
If another version must cohabit with it the "llvmlite" directories
will conflict. For this sort of approach to work easily the llvmlite
directory should be named "llvmlite-0.33.0" and top_level.txt should
reference that too. It would be possible (probably) to work around it
though by having llvmlite-0.33.0 only in the common area and use:
ln -s $COMMON/llvmlite-0.33.0 $VENVAREA/llvmlite
The top_level.txt in each could then reference the unversioned name.
Unknown if this soft link approach will work on Windows.
Regards,
David Mathog
On Wed, Jun 24, 2020 at 1:26 PM Steve Dower
On 24Jun2020 1923, David Mathog wrote:
I think I will experiment a little with pipenv and if necessary after each package install use a script to remove the installed libraries and replace them with a hard link to the one in the common area. Maybe it will be possible to put in those links before installing the package of interest (like for scanpy, see first post), which will hopefully keep it from having to rebuild all those packages too.
Here's a recent discussion about this exact idea (with a link to an earlier discussion on this list): https://discuss.python.org/t/proposal-sharing-distrbution-installations-in-g...
It's totally possible, though it's always a balance of trade-offs. Some of the people on that post may be interested in developing a tool to automate parts of the process.
Cheers, Steve
On Tue, 2020-06-23 at 15:51 -0700, David Mathog wrote:
What I am after is some method of keeping exactly one copy of each package-version in the common area (ie, one might find foo-1.2, foo-1.7, and foo-2.3 there), while also presenting only the one version of each (let's say foo-1.7) to a particular installed program. On linux it might do that by making soft links to the common PYTHONPATH area from another directory for which it sets PYTHONPATH for the application. Finally, this has to be usable by any account which has read execute access to the main directory.
Does such a beast exist? If so, please point me to it!
I have been meaning to do something like this for a while now! But unfortunately I can't find the time. If you do choose of start implementing it, please let me know. I would be happy to help out. Cheers, Filipe Laíns
It turned out that the second install was not the cause of the
timestamp change in the original. On reviewing "history" it turned
out that I had accidentally run the link generation twice. That
turned up this (for me) unexpected behavior:
mkdir /tmp/foo
ls -al /tmp/foo
total 16
drwxrwxr-x. 2 modules modules 6 Jun 24 16:49 .
drwxrwxrwt. 173 root root 12288 Jun 24 16:49 ..
ln -s /tmp/foo /tmp/bar
ls -al /tmp/foo
drwxrwxr-x. 2 modules modules 6 Jun 24 16:49 .
drwxrwxrwt. 173 root root 12288 Jun 24 16:49 ..
ln -s /tmp/foo /tmp/bar
ls -al /tmp/foo
total 16
drwxrwxr-x. 2 modules modules 17 Jun 24 16:51 .
drwxrwxrwt. 173 root root 12288 Jun 24 16:50 ..
lrwxrwxrwx. 1 modules modules 8 Jun 24 16:51 foo -> /tmp/foo
The repeated soft link actually put a file under the target. Strange.
Apparently it is expected behavior. The problem can be avoided by
using this form:
ln -sn $TARGET $LINK
The later installs are much faster than the first one, since putting
in the links is very fast and building the packages is not. This was
the trivial case though, since having done one install all the
prerequisites were just "there". The johnnydep package will list the
dependencies without doing the install. Guess I will throw something
together based on that and the above results and see how it goes.
Regards,
David Mathog
On Wed, Jun 24, 2020 at 4:23 PM Filipe Laíns
On Tue, 2020-06-23 at 15:51 -0700, David Mathog wrote:
What I am after is some method of keeping exactly one copy of each package-version in the common area (ie, one might find foo-1.2, foo-1.7, and foo-2.3 there), while also presenting only the one version of each (let's say foo-1.7) to a particular installed program. On linux it might do that by making soft links to the common PYTHONPATH area from another directory for which it sets PYTHONPATH for the application. Finally, this has to be usable by any account which has read execute access to the main directory.
Does such a beast exist? If so, please point me to it!
I have been meaning to do something like this for a while now! But unfortunately I can't find the time.
If you do choose of start implementing it, please let me know. I would be happy to help out.
Cheers, Filipe Laíns
On Thu, 25 Jun 2020 at 00:06, David Mathog
Thanks for the link. Unfortunately there was not a reference to a completed package that actually did this. As in, I really do not want to reinvent the wheel. Ugh, sorry, that's a pun in this context.
I think the key message here is that you won't be *re*-inventing the wheel. This is a wheel that still needs to be invented. Paul (It was *way* too hard trying to write the above without tripping over the extended "wheel" pun ;-))
On Thu, Jun 25, 2020 at 12:37 AM Paul Moore
I think the key message here is that you won't be *re*-inventing the wheel. This is a wheel that still needs to be invented.
It _was_ invented, but it is off round and gives a rough ride. As noted in the first post this: __requires__ = ['scipy <1.3.0,>=1.2.0', 'anndata <0.6.20', 'loompy <3.0.0,>=2.00', 'h5py <2.10'] import pkg_resources was able to load the desired set of package-versions for scanpy, but setting a version number constraint on scanpy itself at the end of that list, one which matched the version that the preceding commands successfully loaded, broke it. So it is not reliable. And the entire __requires__ kludge is only present because for reasons beyond my pay grade this: import pkg_resources pkg_resources.require("scipy<1.3.0,>=1.2.0;anndata<0.6.20;etc.") import scipy import anndata #etc. cannot work because by default "import pkg_resources" keeps only the most recent version rather than making up a tree (or list or hash or whatever) and waiting to see if there are any version constraints to be applied at the time of actual package import. What I'm doing now is basically duct tape and bailing wire to work around those deeper issues. In terms of language design, a much better fix would be to modify pkg_resources so that it will always successfully load the required versions from a designated directory which contains multiple versions of packages, and modify the package maintenance tools so that they can maintain such a directory. Regards, David Mathog
Questions about naming conventions. The vast majority of packages when they install create in site-packages two directories with names like: foobar foobar-1.2.3.dist-info (or egg-info) However PyYAML creates: yaml PyYAML-5.3.1-py3.6.egg-info and there is also this: pkg_resources which is not associated with a versioned package. In python3
import yaml import pkg_resources print(yaml.__version__) 5.3.1 print(pkg_resources.__version__) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: module 'pkg_resources' has no attribute '__version__'
So by what method could code working outside of python possibly determine that
"yaml" goes with "PyYAML"? Is this a common situation?
Is pkg_resources actually a package? Does it make sense for a common
package repository to have a single instance of this directory or
should each installed python based program retain its own version of
this?
There are some other files that live in site-packages which are not
actually packages. The list so far is:
__pycache__
#some dynamic libraries, like
kiwisolver.cpython-36m-x86_64-linux-gnu.so
#some pth files, but always so far with an explicit version number, like
sphinxcontrib_applehelp-1.0.2-py3.8-nspkg.pth
#or associated with a package with a version number like:
setuptools
setuptools-46.1.3.dist-info
setuptools.pth
#some py files, apparently when that package does not make a corresponding
#directory like:
zipp-3.1.0.dist-info
zipp.py
#initialization file "site" as
site.py
site.pyc
Any others to look out for? That is, files which might be installed
in site-packages but which should not be shared.
Hopefully this next is an appropriate question for this list, since
the issue arises from how python loads packages. Is there any way to
avoid collisions between python based programs other than activating
and deactivating their virtualenvs, or redefining PYTHONPATH, before
each is used? Programs that have the property that their library
loading is determinate (usually the case with C, fortran, bash
scripts, etc.)one can construct a bash script (for instance) which
runs 3 programs in order like so:
prog1
prog2
prog3 # spawns subprocesses which run prog2 and prog1
and there are not generally any issues. (Yes, one can create a mess
with LD_PRELOAD and the like.) But if those 3 are python programs
unless prog1, prog2, prog3 are all built into the same virtualenv,
which usually means they come from the same software distribution, I
don't see how to avoid conflicts for the first two cases without
activating/deactivating each one, which looks like it might be tricky
in the 3rd case.
If one has a directory like:
TOP/bin/prog
TOP/lib/python3.6/site-packages
Other than using PYTHONPATH to direct to it with an absolute path, is
there any way to force prog to only import from that specific
site-packages? Let me try that again. Is there a way to tell prog
via any environmental variable to look in
"../lib/python3.6/site-packages" (and nowhere else) for imports, with
the reference directory being that where prog is installed, not where
the process PWD might happen to be. Because if that was possible it
might allow a sort of "set it and forget it" method like
export PYTHONRELPATHFROMPROG="../lib/python3.6/site-packages
prog1 #uses prog1 site-package
prog2 #uses prog2 site-package
prog3 #uses prog3 site-package
# prog1 subprocess #uses prog1 site-package
# prog2 subprocess #uses prog2 site-package
(None of which would be necessary if python programs could import
specific versions reliably from a common directory containing multiple
versions of each package.)
Thanks,
David Mathog
On Thu, Jun 25, 2020 at 10:46 AM David Mathog
On Thu, Jun 25, 2020 at 12:37 AM Paul Moore
wrote: I think the key message here is that you won't be *re*-inventing the wheel. This is a wheel that still needs to be invented.
It _was_ invented, but it is off round and gives a rough ride. As noted in the first post this:
__requires__ = ['scipy <1.3.0,>=1.2.0', 'anndata <0.6.20', 'loompy <3.0.0,>=2.00', 'h5py <2.10'] import pkg_resources
was able to load the desired set of package-versions for scanpy, but setting a version number constraint on scanpy itself at the end of that list, one which matched the version that the preceding commands successfully loaded, broke it. So it is not reliable.
And the entire __requires__ kludge is only present because for reasons beyond my pay grade this:
import pkg_resources pkg_resources.require("scipy<1.3.0,>=1.2.0;anndata<0.6.20;etc.") import scipy import anndata #etc.
cannot work because by default "import pkg_resources" keeps only the most recent version rather than making up a tree (or list or hash or whatever) and waiting to see if there are any version constraints to be applied at the time of actual package import.
What I'm doing now is basically duct tape and bailing wire to work around those deeper issues. In terms of language design, a much better fix would be to modify pkg_resources so that it will always successfully load the required versions from a designated directory which contains multiple versions of packages, and modify the package maintenance tools so that they can maintain such a directory.
Regards,
David Mathog
On Fri, Jun 26, 2020 at 12:43 PM David Mathog
So by what method could code working outside of python possibly determine that "yaml" goes with "PyYAML"?
Sorry, I forgot that the information was in PyYAML-5.3.1-py3.6.egg-info/top_level.txt Still, how common is that? Can anybody offer an estimate about what fraction of packages use different names like that? Thanks, David Mathog Is this a common situation?
Is pkg_resources actually a package? Does it make sense for a common package repository to have a single instance of this directory or should each installed python based program retain its own version of this?
There are some other files that live in site-packages which are not actually packages. The list so far is:
__pycache__
#some dynamic libraries, like kiwisolver.cpython-36m-x86_64-linux-gnu.so
#some pth files, but always so far with an explicit version number, like sphinxcontrib_applehelp-1.0.2-py3.8-nspkg.pth #or associated with a package with a version number like: setuptools setuptools-46.1.3.dist-info setuptools.pth
#some py files, apparently when that package does not make a corresponding #directory like: zipp-3.1.0.dist-info zipp.py
#initialization file "site" as site.py site.pyc
Any others to look out for? That is, files which might be installed in site-packages but which should not be shared.
Hopefully this next is an appropriate question for this list, since the issue arises from how python loads packages. Is there any way to avoid collisions between python based programs other than activating and deactivating their virtualenvs, or redefining PYTHONPATH, before each is used? Programs that have the property that their library loading is determinate (usually the case with C, fortran, bash scripts, etc.)one can construct a bash script (for instance) which runs 3 programs in order like so:
prog1 prog2 prog3 # spawns subprocesses which run prog2 and prog1
and there are not generally any issues. (Yes, one can create a mess with LD_PRELOAD and the like.) But if those 3 are python programs unless prog1, prog2, prog3 are all built into the same virtualenv, which usually means they come from the same software distribution, I don't see how to avoid conflicts for the first two cases without activating/deactivating each one, which looks like it might be tricky in the 3rd case.
If one has a directory like:
TOP/bin/prog TOP/lib/python3.6/site-packages
Other than using PYTHONPATH to direct to it with an absolute path, is there any way to force prog to only import from that specific site-packages? Let me try that again. Is there a way to tell prog via any environmental variable to look in "../lib/python3.6/site-packages" (and nowhere else) for imports, with the reference directory being that where prog is installed, not where the process PWD might happen to be. Because if that was possible it might allow a sort of "set it and forget it" method like
export PYTHONRELPATHFROMPROG="../lib/python3.6/site-packages prog1 #uses prog1 site-package prog2 #uses prog2 site-package prog3 #uses prog3 site-package # prog1 subprocess #uses prog1 site-package # prog2 subprocess #uses prog2 site-package
(None of which would be necessary if python programs could import specific versions reliably from a common directory containing multiple versions of each package.)
Thanks,
David Mathog
On Thu, Jun 25, 2020 at 10:46 AM David Mathog
wrote: On Thu, Jun 25, 2020 at 12:37 AM Paul Moore
wrote: I think the key message here is that you won't be *re*-inventing the wheel. This is a wheel that still needs to be invented.
It _was_ invented, but it is off round and gives a rough ride. As noted in the first post this:
__requires__ = ['scipy <1.3.0,>=1.2.0', 'anndata <0.6.20', 'loompy <3.0.0,>=2.00', 'h5py <2.10'] import pkg_resources
was able to load the desired set of package-versions for scanpy, but setting a version number constraint on scanpy itself at the end of that list, one which matched the version that the preceding commands successfully loaded, broke it. So it is not reliable.
And the entire __requires__ kludge is only present because for reasons beyond my pay grade this:
import pkg_resources pkg_resources.require("scipy<1.3.0,>=1.2.0;anndata<0.6.20;etc.") import scipy import anndata #etc.
cannot work because by default "import pkg_resources" keeps only the most recent version rather than making up a tree (or list or hash or whatever) and waiting to see if there are any version constraints to be applied at the time of actual package import.
What I'm doing now is basically duct tape and bailing wire to work around those deeper issues. In terms of language design, a much better fix would be to modify pkg_resources so that it will always successfully load the required versions from a designated directory which contains multiple versions of packages, and modify the package maintenance tools so that they can maintain such a directory.
Regards,
David Mathog
(Sending to the list this time.)
On 2020 Jun 26, at 15:43, David Mathog
So by what method could code working outside of python possibly determine that "yaml" goes with "PyYAML"?
By checking all *.dist-info/RECORD files to see which one mentions the "yaml" directory. (top_level.txt could also be checked, but I believe that only setuptools creates this file — projects built with flit or poetry don't have it — and it's not very helpful when namespace packages are involved.)
Is this a common situation?
It happens whenever the project "foo" distributes a module named something other than "foo". Other projects like this that I can think of off the top of my head are BeautifulSoup4 (module: bs4), python-dateutil (module: dateutil), and attrs (module: attr).
Is pkg_resources actually a package?
pkg_resources is a module distributed by the setuptools project (alongside the modules "setuptools" and "easy_install").
Does it make sense for a common package repository to have a single instance of this directory or should each installed python based program retain its own version of this?
There should be one instance per each version of setuptools stored in the repository. -- John Wodder
On 2020 Jun 26, at 15:50, David Mathog
Still, how common is that? Can anybody offer an estimate about what fraction of packages use different names like that?
Scanning through the wheelodex.org database (specifically, a dump from earlier this week) finds 32,517 projects where the wheel DOES NOT contain a top-level module of the same name as the project (after correcting for differences in case and hyphen vs. underscore vs. period) and 74,073 projects where the wheel DOES contain a module of the same name. (5,417 projects containing no modules were excluded.) Note that a project named "foo-bar" containing a namespace package "foo/bar" is counted in the former group. Of the 32,517 non-matching projects, 7,117 were Odoo projects with project names of the form "odoo{version}_addon_{foo}" containing namespace modules of the form "odoo/addons/{foo}", and 3,175 were Django projects with project names of the form "django_{foo}" containing packages named just "{foo}". No other major patterns seem to stand out. -- John Wodder
Thanks for that feedback. Looks like RECORD is the one to use.
The names of the directories ending in dist-info seem to be uniformly:
package-version.dist_info
but the directory names associated with eggs come in a lot of flavors:
anndata-0.6.19-py3.6.egg
cutadapt-2.10.dev20+g93fb340-py3.6-linux-x86_64.egg
scanpy-1.5.2.dev7+ge33a2f33-py3.6.egg
h5py-2.9.0-py3.6-linux-x86_64.egg
simplejson-3.17.0-py3.6.egg-info
johnnydep does not give any hints that this is coming:
johnnydep --output-format pinned h5py
#relevant part: h5py==2.10.0
What would be some small examples for other package managers, I would
like to see what they have as equivalents to dist-info and egg-info so
that the script does not choke on it.
Some progress with the test script. It can now convert a virtualenv
to a regular directory
and migrate the site-packages contents to a shared area. A second
migration of a copy of the same virtualenv to a different regular
directory correctly makes links to the first set.
(That is, two normal directories both linked to one common set of
packages.) And the test program (johnnydep) runs in both with
PYTHONPATH set correctly. But preinstalling, that is setting links to
the common directory before doing a normal install is tricky because
of the name inconsistencies. To do that it must run johnnydep to get
the necessary information, and that is not very fast. A normal
install of johnnydep itself, complete with downloads, takes less time
than that programs own analysis!
time johnnydep johnnydep
#21s
vs.
rm -rf ~/.cache/pip #force actual downloads
#too fast to measure
time python3 -m venv johnnydep
#2.3s
source johnnydep/bin/activate
#too fast to measure
time python -m pip install -U pip #update 9.0.3 to 20.1.1
#3.4s
time pip3 install johnnydep
#7.8s
Probably a package with a huge amount of compilation would be a win
for a preinstall, but
it is at this point definitely not an "always faster" option.
Thanks,
David Mathog
On Fri, Jun 26, 2020 at 2:51 PM John Thorvald Wodder II
On 2020 Jun 26, at 15:50, David Mathog
wrote: Still, how common is that? Can anybody offer an estimate about what fraction of packages use different names like that?
Scanning through the wheelodex.org database (specifically, a dump from earlier this week) finds 32,517 projects where the wheel DOES NOT contain a top-level module of the same name as the project (after correcting for differences in case and hyphen vs. underscore vs. period) and 74,073 projects where the wheel DOES contain a module of the same name. (5,417 projects containing no modules were excluded.) Note that a project named "foo-bar" containing a namespace package "foo/bar" is counted in the former group.
Of the 32,517 non-matching projects, 7,117 were Odoo projects with project names of the form "odoo{version}_addon_{foo}" containing namespace modules of the form "odoo/addons/{foo}", and 3,175 were Django projects with project names of the form "django_{foo}" containing packages named just "{foo}". No other major patterns seem to stand out.
-- John Wodder -- Distutils-SIG mailing list -- distutils-sig@python.org To unsubscribe send an email to distutils-sig-leave@python.org https://mail.python.org/mailman3/lists/distutils-sig.python.org/ Message archived at https://mail.python.org/archives/list/distutils-sig@python.org/message/V445K...
On Sat, 27 Jun 2020 at 01:37, David Mathog
Thanks for that feedback. Looks like RECORD is the one to use.
The names of the directories ending in dist-info seem to be uniformly:
package-version.dist_info
Note that if you're doing something like this, you should probably read PEP 376 (https://www.python.org/dev/peps/pep-0376/) which defines the standard layout of installed packages.
but the directory names associated with eggs come in a lot of flavors:
anndata-0.6.19-py3.6.egg cutadapt-2.10.dev20+g93fb340-py3.6-linux-x86_64.egg scanpy-1.5.2.dev7+ge33a2f33-py3.6.egg h5py-2.9.0-py3.6-linux-x86_64.egg simplejson-3.17.0-py3.6.egg-info
The egg format is an older format that was never standardised, so details of that format are likely somewhere in the setuptools documentation. .egg-info directories are the older equivalent of dist-info directories, but egg directories are a very different format (they contain the full distribution plus metadata in one directory). You;d have to find the setuptools documentation of the egg format for that. (Note that the egg format is obsolete, so you may need to look at older documentation - I don't know if the current setuptools docs describe the format). I'm not aware what other formats tools like conda use, sorry. Paul
On Fri, Jun 26, 2020 at 2:51 PM John Thorvald Wodder II
Of the 32,517 non-matching projects, 7,117 were Odoo projects with project names of the form "odoo{version}_addon_{foo}" containing namespace modules of the form "odoo/addons/{foo}", and 3,175 were Django projects with project names of the form "django_{foo}" containing packages named just "{foo}". No other major patterns seem to stand out.
In CentOS 8 the RPM python3-rhnlib-2.8.6-8.module_el8.1.0+211+ad6c0bc7.noarch has loaded into the directory /usr/lib/python3.6/site-packages two entries rhn # a directory rhnlib-2.8.6-py3.6.egg-info #a file The latter contains just this text: Metadata-Version: 1.0 Name: rhnlib Version: 2.8.6 Summary: Python libraries for the Spacewalk project Home-page: http://rhn.redhat.com Author: Mihai Ibanescu Author-email: misa@redhat.com License: GPL Description: rhnlib is a collection of python modules used by the Spacewalk (http://spacewalk.redhat.com) software. Platform: UNKNOWN Nor is there a link in the other direction: grep -iR rhnlib /usr/lib/python3.6/site-packages/rhn #nothing So while "rhn" bears a similarity to "rhnlib" it is neither the package name nor is it listed in the egg-info. This was of course installed by dnf (AKA yum) and not by egg. Is it possible for any python installer (as opposed to dnf, which runs outside of it) to install an unreferenced directory like this? Presumably not with a dist-info, but with an egg-info that does not in any way reference the active part of the installation? In a small collection (172 packages) here these were the only two "file" egg-info entries found, with their associated directories: busco BUSCO-4.0.6-py3.6.egg-info ngs ngs-1.0-py3.6.egg-info In neither case does the egg-info file reference the corresponding directory, but at least the directory in both has the expected package name (other than case). In the examples you cited at the top, were any of those "different name" cases from packages with a "file" egg-info? Thanks, David Mathog
On 2020 Jun 29, at 16:09, David Mathog
In neither case does the egg-info file reference the corresponding directory, but at least the directory in both has the expected package name (other than case). In the examples you cited at the top, were any of those "different name" cases from packages with a "file" egg-info?
The projects I examined were all in wheel form and thus had *.dist-info directories instead of *.egg-info. I know very little about how eggs work, other than that they're deprecated and should be avoided in favor of wheels. -- John Wodder
Hi all.
"Python devirtualizer" is a preliminary implementation which manages
shared packages so that only one copy of each package version is
required. It installs into a virtualenv, then migrates the contents
out into the normal OS environment, and while so doing, replaces what
would be duplicate files with soft links to a single copy. It is
downloadable from here:
https://sourceforge.net/projects/python-devirtualizer/
It is linux (or other POSIX like system, __maybe__ Mac) specific. No
way it will run on Windows at this point because the main script is
bash and the paths assume POSIX path syntax. (Might work in Mingw64
though.)
Anyway,
pdvctrl install packageA
pdvctrl migrate packageA /wherever/packageA
pdvctrl install packageB
pdvctrl migrate packageB /wherever/packageB
will result in a single copy of the shared dependencies on this
system, with both packageA and packageB hooked to them with soft
links. The import does not go awry because from within each package's
site-packages directory there are only links to the files it needs, so
it never sees any conflicting package versions.
There is also:
pdvctrl preinstall packageC
pdvctrl install packageC
pdvctrl migrate packageC /wherever/packageC
which first uses johnnydep to look up dependencies already on the
system and links those in directly before going on to install any
pieces not so installed. Unfortunately the johnnydep runs with
"preinstall" have so far been significantly slower than just doing a
normal install and letting the migrate throw out the extra copy. On
the other hand, the one package I have encountered which has
conflicting requirements (scanpy-scripts) fails in a more
comprehensible manner with "preinstall" than with "install".
Migrate "wraps" the files in the package's "bin" directory, if any, so
that they may be invoked solely by PATH like a regular program. This
uses libSDL2 to get the absolute path of the wrapper program, and it
defines PYTHONPATH before execve() to the actual target. So no
messing about with PYTHONPATH in the user's shell or in scripts. So
far I have not run into a problem with the wrappers, which essentially
just inject a PYTHONPATH into the environment when the program is run.
Well, one package (busco) had a file with no terminal EOL, which
resulted in its last line being dropped while it was being wrapped,
but that case is now handled. I do expect though at some point to
encounter a package which has several files in its bin, and
first_program will contain some variant of:
python3 /wherever/bin/second_program
The wrapper will break those, since the wrapper is a regular binary
and not a python script.
Regards,
David Mathog
On Mon, Jun 29, 2020 at 1:43 PM John Thorvald Wodder II
On 2020 Jun 29, at 16:09, David Mathog
wrote: In neither case does the egg-info file reference the corresponding directory, but at least the directory in both has the expected package name (other than case). In the examples you cited at the top, were any of those "different name" cases from packages with a "file" egg-info?
The projects I examined were all in wheel form and thus had *.dist-info directories instead of *.egg-info. I know very little about how eggs work, other than that they're deprecated and should be avoided in favor of wheels.
-- John Wodder -- Distutils-SIG mailing list -- distutils-sig@python.org To unsubscribe send an email to distutils-sig-leave@python.org https://mail.python.org/mailman3/lists/distutils-sig.python.org/ Message archived at https://mail.python.org/archives/list/distutils-sig@python.org/message/DMRPH...
participants (6)
-
David Mathog
-
Filipe Laíns
-
John Thorvald Wodder II
-
Paul Moore
-
Steve Dower
-
Thomas Kluyver