Experience of setuptools' cache design
Hello. Being nothing but an innocent bystander, I have not thuroughly searched the archives of this list, and might be writing about something that is "old news" since long ago. If so, I apologize. This is a small story about what I, being an end user that did not know what "setuptools" was, experienced when I installed the latest MySQLdb module (and I am aware that there might very well be issues with how it was installed). My hope is that this is useful to the developers of setuptools in determining if they meet the design criteria they have set up. A colleague of mine installed the latest MySQLdb module, which uses eggs and setuptools. We use Python and MySQL heavily on our mailserver. In that environment, user names and home directories do not always match, and indeed most users running the programs are system users and do not have home directories at all. This showed to be a problem with the setuptools-based MySQLdb module. The reason is the .python-eggs directory. Our configuration daemon runs as a system user "graald", a user without a home directory (that user does not have a vaild shell either). After updating MySQLdb to the setuptools-based version, the Python server code no longer started, because setuptools tried to write to the .python-eggs directory in the user's $HOME. And $HOME is not a valid directory (or, often, the home directory of the person starting the script rather than the account it runs as). Ok, so we can set the PYTHON_EGG_CACHE environment variable, which solves the problem (but in an unclean way). Then comes the Python script that runs as "virtmail" and tries to connect to MySQL, "virtmail" being another user without a home directory. And of course it does not start, and cannot have the same PYTHON_EGG_CACHE directory, since both these users cannot write to the same directory. And, not having a home directory, the "virtmail" user does not have a .bashrc or similar where we can set that user's own cache directory. We cannot set PYTHON _EGG_CACHE in the script itself either, as we cannot know what to set it to. Because if two different system users run the same script, they need different cache directories. Uh-oh. We need to implement code to direct the cache directory depending on the user that's running. And when I tried to run a script under my own user, with a proper home directory, nothing worked as expected either. The reason? I have previously run a Python script as root (but without changing my $HOME at that time). So now I have a .python-eggs in my own home directory owned by root:root. My thoughts (perhaps useful, perhaps not): * Primarly, I think it is unfortunate that an "import foo" starts creating files in the file system - it is not what I personally expect from doing an "import"! * On our user system, with some 20.000 active users, there will be up to 20.000 copies of a .python-eggs directory if someone installs a program that uses a Python Egg (but does not have access to site- packages or does not know how to detect what eggs are used and how to install them there). * I, personally, think it would be better if I explicitly have to _request_ a per-user cache directory being made, rather than needing to implement solutions to _prevent_ that from happening. * If the default is to remain to create files on "import", I would like error checking and fall backs. If the cache directory cannot be created in $HOME, I would like the code to create it somewhere else (or not at all) instead of giving me an exception. As end user I did not request the cache-directory to be made, and therefore do not want to be given an exception caused by it not being created. Especially as I do not know what to do with such an exception. Perhaps creating it in /tmp/python-eggs-$USERNAME, for example. Thanks for listening. /Viktor
At 11:34 AM 1/18/2008 +0100, Viktor Fougstedt wrote:
My thoughts (perhaps useful, perhaps not):
* Primarly, I think it is unfortunate that an "import foo" starts creating files in the file system - it is not what I personally expect from doing an "import"!
Note that normal imports will also create .pyc or .pyo files alongside the source, if a valid compiled version of the source isn't available.
* On our user system, with some 20.000 active users, there will be up to 20.000 copies of a .python-eggs directory if someone installs a program that uses a Python Egg (but does not have access to site- packages or does not know how to detect what eggs are used and how to install them there).
If you want eggs to always be installed unzipped, you can add an --always-unzip option to your site-wide distutils.cfg. Then, by default this will unzip an egg at installation time rather than extracting libraries at run time.
* I, personally, think it would be better if I explicitly have to _request_ a per-user cache directory being made, rather than needing to implement solutions to _prevent_ that from happening.
* If the default is to remain to create files on "import", I would like error checking and fall backs. If the cache directory cannot be created in $HOME, I would like the code to create it somewhere else (or not at all) instead of giving me an exception. As end user I did not request the cache-directory to be made, and therefore do not want to be given an exception caused by it not being created. Especially as I do not know what to do with such an exception. Perhaps creating it in /tmp/python-eggs-$USERNAME, for example.
That seems like a reasonable fallback, and I'll take a look at implementing it in a future release. Thanks for the idea!
On Jan 18, 2008, at 9:49 AM, Phillip J. Eby wrote:
At 11:34 AM 1/18/2008 +0100, Viktor Fougstedt wrote:
My thoughts (perhaps useful, perhaps not):
* Primarly, I think it is unfortunate that an "import foo" starts creating files in the file system - it is not what I personally expect from doing an "import"!
Note that normal imports will also create .pyc or .pyo files alongside the source, if a valid compiled version of the source isn't available.
Only if the process has write access to the directory. No error occurs if the pyc files can't be written. Jim -- Jim Fulton Zope Corporation
On Fri, 2008-01-18 at 09:49 -0500, Phillip J. Eby wrote:
* On our user system, with some 20.000 active users, there will be up to 20.000 copies of a .python-eggs directory if someone installs a program that uses a Python Egg (but does not have access to site- packages or does not know how to detect what eggs are used and how to install them there).
If you want eggs to always be installed unzipped, you can add an --always-unzip option to your site-wide distutils.cfg. Then, by default this will unzip an egg at installation time rather than extracting libraries at run time.
With my distutils distutils.__version__=='2.5.1' I had to use this: [easy_install] zip_ok = False --always-unzip was not recognized -- Lloyd Kvam Venix Corp DLSLUG/GNHLUG library http://www.librarything.com/catalog/dlslug http://www.librarything.com/profile/dlslug http://www.librarything.com/rsshtml/recent/dlslug
On Jan 18, 2008, at 9:49 AM, Phillip J. Eby wrote: ...
* If the default is to remain to create files on "import", I would like error checking and fall backs. If the cache directory cannot be created in $HOME, I would like the code to create it somewhere else (or not at all) instead of giving me an exception. As end user I did not request the cache-directory to be made, and therefore do not want to be given an exception caused by it not being created. Especially as I do not know what to do with such an exception. Perhaps creating it in /tmp/python-eggs-$USERNAME, for example.
That seems like a reasonable fallback, and I'll take a look at implementing it in a future release. Thanks for the idea!
I'm not sure which of the ideas above you are referring to but I think implicitly creating files at run time outside the installed package is a bad idea. I would want some way to prevent it, especially in a production environment. Jim -- Jim Fulton Zope Corporation
At 10:13 AM 1/18/2008 -0500, Jim Fulton wrote:
On Jan 18, 2008, at 9:49 AM, Phillip J. Eby wrote: ...
* If the default is to remain to create files on "import", I would like error checking and fall backs. If the cache directory cannot be created in $HOME, I would like the code to create it somewhere else (or not at all) instead of giving me an exception. As end user I did not request the cache-directory to be made, and therefore do not want to be given an exception caused by it not being created. Especially as I do not know what to do with such an exception. Perhaps creating it in /tmp/python-eggs-$USERNAME, for example.
That seems like a reasonable fallback, and I'll take a look at implementing it in a future release. Thanks for the idea!
I'm not sure which of the ideas above you are referring to but I think implicitly creating files at run time outside the installed package is a bad idea.
I'm referring to the idea of using a temporary directory as a cache.
I would want some way to prevent it, especially in a production environment.
The only way to absolutely prevent it is to install things unzipped. And if you're going to do that, you might as well use .egg-info style eggs, since that will give better runtime performance. That is, for an egg "foo-1.2-py2.5-platform.egg", you would unzip its contents to the target directory and then rename the extracted EGG-INFO directory to "foo-1.2-py2.5-platform.egg-info".
Jim
-- Jim Fulton Zope Corporation
On Jan 18, 2008, at 12:32 PM, Phillip J. Eby wrote:
At 10:13 AM 1/18/2008 -0500, Jim Fulton wrote: ...
I would want some way to prevent it, especially in a production environment.
The only way to absolutely prevent it is to install things unzipped. And if you're going to do that, you might as well use .egg-info style eggs, since that will give better runtime performance. That is, for an egg "foo-1.2-py2.5-platform.egg", you would unzip its contents to the target directory and then rename the extracted EGG-INFO directory to "foo-1.2-py2.5-platform.egg-info".
Thanks. Can you briefly explain or provide a link to something that explains the performance improvement? JIm -- Jim Fulton Zope Corporation
At 12:50 PM 1/18/2008 -0500, Jim Fulton wrote:
On Jan 18, 2008, at 12:32 PM, Phillip J. Eby wrote:
At 10:13 AM 1/18/2008 -0500, Jim Fulton wrote: ...
I would want some way to prevent it, especially in a production environment.
The only way to absolutely prevent it is to install things unzipped. And if you're going to do that, you might as well use .egg-info style eggs, since that will give better runtime performance. That is, for an egg "foo-1.2-py2.5-platform.egg", you would unzip its contents to the target directory and then rename the extracted EGG-INFO directory to "foo-1.2-py2.5-platform.egg-info".
Thanks.
Can you briefly explain or provide a link to something that explains the performance improvement?
Fewer directories on sys.path = better import performance, compared to individually putting a series of .egg directories on sys.path. This doesn't apply so much for zipped eggs, because the zip directories get cached. Overall, the tradeoffs are: zipped .egg files = faster imports + some startup overhead to read the zip directories .egg directories = slower imports due to lots of stat-ing .egg-info = "normal" imports + no extra startup overhead Zipped .egg files have the fastest import times overall, assuming you don't have so many eggs to read that the overhead wipes out your gains. Extracting all eggs to the same directory is second fastest, with .egg directories (i.e. --always-unzip) being the slowest thing you could possibly do.
On Jan 18, 2008, at 1:04 PM, Phillip J. Eby wrote:
At 12:50 PM 1/18/2008 -0500, Jim Fulton wrote:
On Jan 18, 2008, at 12:32 PM, Phillip J. Eby wrote:
At 10:13 AM 1/18/2008 -0500, Jim Fulton wrote: ...
I would want some way to prevent it, especially in a production environment.
The only way to absolutely prevent it is to install things unzipped. And if you're going to do that, you might as well use .egg-info style eggs, since that will give better runtime performance. That is, for an egg "foo-1.2-py2.5-platform.egg", you would unzip its contents to the target directory and then rename the extracted EGG-INFO directory to "foo-1.2-py2.5-platform.egg-info".
Thanks.
Can you briefly explain or provide a link to something that explains the performance improvement?
Fewer directories on sys.path = better import performance, compared to individually putting a series of .egg directories on sys.path.
Ah, I didn't understand what you meant by "unzip the contents to the target directory". So the idea is that you'd merge the contents of multiple zip files into a single "target" directory. This seems rather messy to me. It doesn't appear to be compatible with multi-version installs. Also, if a newer egg version removes a file, the removed file will be left installed after an upgrade. If two eggs provide the same file, files will be overridden. Admittedly, this is a somewhat pathological case, but the overriding seems to compound the pathology.
This doesn't apply so much for zipped eggs, because the zip directories get cached. Overall, the tradeoffs are:
zipped .egg files = faster imports + some startup overhead to read the zip directories .egg directories = slower imports due to lots of stat-ing .egg-info = "normal" imports + no extra startup overhead
Zipped .egg files have the fastest import times overall, assuming you don't have so many eggs to read that the overhead wipes out your gains. Extracting all eggs to the same directory is second fastest, with .egg directories (i.e. --always-unzip) being the slowest thing you could possibly do.
I wonder what the performance impacts really are. I'm looking forward to measuring this some day for our projects. We've put off making our eggs zip safe so far, as this hasn't been a high priority. :/ As described in a separate thread, I'm going to add an option to buildout so buildout users can explicitly define a directory to use as a setuptools cache. In that case, zip-safe eggs can remain zipped even if they have extensions. Jim -- Jim Fulton Zope Corporation
At 01:31 PM 1/18/2008 -0500, Jim Fulton wrote:
On Jan 18, 2008, at 1:04 PM, Phillip J. Eby wrote:
At 12:50 PM 1/18/2008 -0500, Jim Fulton wrote:
On Jan 18, 2008, at 12:32 PM, Phillip J. Eby wrote:
At 10:13 AM 1/18/2008 -0500, Jim Fulton wrote: ...
Can you briefly explain or provide a link to something that explains the performance improvement?
Fewer directories on sys.path = better import performance, compared to individually putting a series of .egg directories on sys.path.
Ah, I didn't understand what you meant by "unzip the contents to the target directory".
So the idea is that you'd merge the contents of multiple zip files into a single "target" directory.
Correct.
This seems rather messy to me.
Well, it's what we all used to do before eggs came along, and what most packaging systems still do now. :)
It doesn't appear to be compatible with multi-version installs.
It *can* be compatible, with certain limitations.
Also, if a newer egg version removes a file, the removed file will be left installed after an upgrade. If two eggs provide the same file, files will be overridden. Admittedly, this is a somewhat pathological case, but the overriding seems to compound the pathology.
Again, this is all true without eggs now. Clearly, eggs have spoiled you tremendously. ;-) Seriously, though, if buildout is tracking what files get installed or overwritten during unzipping, you can manage all of this, just as you presumably do for any other sort of installation recipe, no?
As described in a separate thread, I'm going to add an option to buildout so buildout users can explicitly define a directory to use as a setuptools cache. In that case, zip-safe eggs can remain zipped even if they have extensions.
If you want to be really safe/careful, you can pre-extract the eggs to the cache, thereby avoiding any runtime permission problems.
On Jan 18, 2008, at 1:55 PM, Phillip J. Eby wrote: ...
It doesn't appear to be compatible with multi-version installs.
It *can* be compatible, with certain limitations.
I don't see how, unless you mean that only one version is handled this way.
Also, if a newer egg version removes a file, the removed file will be left installed after an upgrade. If two eggs provide the same file, files will be overridden. Admittedly, this is a somewhat pathological case, but the overriding seems to compound the pathology.
Again, this is all true without eggs now. Clearly, eggs have spoiled you tremendously. ;-)
Absolutely. No kidding. Eggs are great!
Seriously, though, if buildout is tracking what files get installed or overwritten during unzipping, you can manage all of this, just as you presumably do for any other sort of installation recipe, no?
It isn't possible if different parts need different versions. If I ignore multi-version requirements, then I could keep track of files installed. In addition, a nice feature of buildouts is that you can share egg directories between buildouts. This is very handy, but makes multi- version support even more important.
As described in a separate thread, I'm going to add an option to buildout so buildout users can explicitly define a directory to use as a setuptools cache. In that case, zip-safe eggs can remain zipped even if they have extensions.
If you want to be really safe/careful, you can pre-extract the eggs to the cache, thereby avoiding any runtime permission problems.
Yup, except I don't see a way to enumerate the resources. After all, resources aren't listed in the egg meta data. I could make an educated guess and extract all of the non-py files. Or I guess I could just extract everything. :) Disk space is pretty cheap, so this is really quite practical IMO. JIm -- Jim Fulton Zope Corporation
At 02:08 PM 1/18/2008 -0500, Jim Fulton wrote:
On Jan 18, 2008, at 1:55 PM, Phillip J. Eby wrote: ...
It doesn't appear to be compatible with multi-version installs.
It *can* be compatible, with certain limitations.
I don't see how, unless you mean that only one version is handled this way.
That would be one of the "certain limitations", yes. :) The other is that the version installed this way can't be overridden except by scripts. That is, if you start Python without a script that initializes one of the alternate versions to be on sys.path, then you will get the "single-version" install as the version on sys.path.
On 18 jan 2008, at 19.04, Phillip J. Eby wrote:
At 12:50 PM 1/18/2008 -0500, Jim Fulton wrote:
Can you briefly explain or provide a link to something that explains the performance improvement?
Fewer directories on sys.path = better import performance, compared to individually putting a series of .egg directories on sys.path.
Hi again, and thanks for the quick and interesting responses. For me as a naive end user, the import performance is not critical on any system I run - my site-packages directories rarely contains more than five to ten packages, and have lots of CPU and I/O to use (I even believe frequently recurring stat():s are cached by the OS). I am, personally, more concerned with a packaging system being scalable to many different computer systems, where user-id:s and home directories do not match (and many user-id:s, indeed, do not have home directories), where there are tens of thousands of users, or where there are two or three separate installations of libraries and binaries and the modules need to use the right ones depending on which python interpreter is run. Being scalable to handle thousands of packages in site-packages is not on my personal list of priorities. It's all about which dimension you want the packaging system to be flexible in, I guess. My little vote goes to the "many users and strange systems" dimension. :-) But who knows - many CPAN installations actually _have_ hundreds of dependencies installed. And with a packaging system that makes it as easy to install Python packages, perhaps the same will happen for us. Regards, /Viktor
On Jan 18, 2008, at 5:34 AM, Viktor Fougstedt wrote: ...
My thoughts (perhaps useful, perhaps not):
* Primarly, I think it is unfortunate that an "import foo" starts creating files in the file system - it is not what I personally expect from doing an "import"!
* On our user system, with some 20.000 active users, there will be up to 20.000 copies of a .python-eggs directory if someone installs a program that uses a Python Egg (but does not have access to site- packages or does not know how to detect what eggs are used and how to install them there).
* I, personally, think it would be better if I explicitly have to _request_ a per-user cache directory being made, rather than needing to implement solutions to _prevent_ that from happening.
* If the default is to remain to create files on "import", I would like error checking and fall backs. If the cache directory cannot be created in $HOME, I would like the code to create it somewhere else (or not at all) instead of giving me an exception. As end user I did not request the cache-directory to be made, and therefore do not want to be given an exception caused by it not being created. Especially as I do not know what to do with such an exception. Perhaps creating it in /tmp/python-eggs-$USERNAME, for example.
I couldn't agree more. And, AFAICT, everyone I know, except Phillip, agrees that this is a bad feature. IMO, any package including extensions should be treated as zip- unsafe. I intend to add this policy to buildout. Then the only way to be bitten is to use one of the resource APIs that causes files to be extracted from zipped eggs. I'm happy to consider calls to these APIs to be bugs. As a work around, when installing packages that have this problem, use the -Z option to easy_install to cause them to be installed unzipped (or unzip them yourself). Jim -- Jim Fulton Zope Corporation
On Jan 18, 2008, at 8:06 AM, Jim Fulton wrote:
I couldn't agree more. And, AFAICT, everyone I know, except Phillip, agrees that this is a bad feature.
For what it is worth, this is one of (several) blockers preventing Twisted from using setuptools: http://twistedmatrix.com/trac/ticket/2308#comment:6 (Note: I am not a Twisted devel, and even if the issue described in ticket #230, comment:6 were solved, there are probably still a lot of other reasons why the Twisted devels wouldn't use setuptools. Regards, Zooko
participants (6)
-
Jim Fulton
-
Lloyd Kvam
-
Phillip J. Eby
-
Ronald Oussoren
-
Viktor Fougstedt
-
zooko