Performance implications of having large numbers of eggs?

Has anyone done any investigation into the performance implications of having large numbers of eggs installed? Is there any sort of performance hit? It seems to me that having a really large path might slow down imports a bit, though I suspect this is in C code so probably not a significant problem. It also seems like there might be some startup penalties due to the overhead of setting up the path when using eggs, but this is a one-time cost during python startup, so probably not too bad either. I'm asking because we're in the process to switching our open-source Enthought Tool Suite library to a distribution of components via eggs and we're having some internal debate as to whether we need to minimize the number of eggs or not. It definitely seems nice to have smaller subsets of functionality -- from the point of being able to make things stable, managing their APIs, managing cross-component dependencies, and from the user update size viewpoint. But are we paying a performance penalty for going too small in scope with our eggs? -- Dave

Date: Thu, 28 Jun 2007 10:47:43 -0500 From: Dave Peterson <dpeterson@enthought.com>
Has anyone done any investigation into the performance implications of having large numbers of eggs installed? Is there any sort of performance hit?
It seems to me that having a really large path might slow down imports a bit, though I suspect this is in C code so probably not a significant problem. It also seems like there might be some startup penalties due to the overhead of setting up the path when using eggs, but this is a one-time cost during python startup, so probably not too bad either.
Another option to avoid a startup penalty is to have all eggs installed with -m (not in the easy-install.pth file, or "deactivated") and have code require() the specific dependency. This has the obvious disadvantage of having to change a fair amount of code. But, the advantages are only adding the eggs that are needed to the path when they're needed (instead of every egg in every PYTHONPATH dir), and your code will be sure that it's using the version that it's compatible with. I should mention that I don't have any metrics on the startup penalty, so a change like this may not be worth it if you're only trying to improve that.
I'm asking because we're in the process to switching our open-source Enthought Tool Suite library to a distribution of components via eggs and we're having some internal debate as to whether we need to minimize the number of eggs or not. It definitely seems nice to have smaller subsets of functionality -- from the point of being able to make things stable, managing their APIs, managing cross-component dependencies, and from the user update size viewpoint. But are we paying a performance penalty for going too small in scope with our eggs?
-- Dave
-- Rick Ratzel - Enthought, Inc. 515 Congress Avenue, Suite 2100 - Austin, Texas 78701 512-536-1057 x229 - Fax: 512-536-1059 http://www.enthought.com

On Jun 28, 2007, at 12:05 PM, Rick Ratzel wrote:
Date: Thu, 28 Jun 2007 10:47:43 -0500 From: Dave Peterson <dpeterson@enthought.com>
Has anyone done any investigation into the performance implications of having large numbers of eggs installed? Is there any sort of performance hit?
It seems to me that having a really large path might slow down imports a bit, though I suspect this is in C code so probably not a significant problem. It also seems like there might be some startup penalties due to the overhead of setting up the path when using eggs, but this is a one-time cost during python startup, so probably not too bad either.
Another option to avoid a startup penalty is to have all eggs installed with -m (not in the easy-install.pth file, or "deactivated") and have code require() the specific dependency. This has the obvious disadvantage of having to change a fair amount of code. But, the advantages are only adding the eggs that are needed to the path when they're needed (instead of every egg in every PYTHONPATH dir), and your code will be sure that it's using the version that it's compatible with.
I should mention that I don't have any metrics on the startup penalty, so a change like this may not be worth it if you're only trying to improve that.
Note that buildout takes this a step further by determining what eggs are needed at install time, rather than run time. I imagine that this could speed startup further, but I don't have any metrics either. :) Of course, you could use the same approach without using buildout. Note that sooner or later, I'm pretty sure we're going to need a more clever algorithm, likely with some sort of backtracking, to determining working sets. At that point, it will become very important to not do this at run time. Jim -- Jim Fulton mailto:jim@zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org

At 11:56 AM 7/2/2007 -0400, Jim Fulton wrote:
Note that sooner or later, I'm pretty sure we're going to need a more clever algorithm, likely with some sort of backtracking, to determining working sets. At that point, it will become very important to not do this at run time.
pkg_resources already backtracks if it needs to, although only once (when pkg_resources is imported) and only if the start script's requirements conflict with packages that are activated by default.

At 10:47 AM 6/28/2007 -0500, Dave Peterson wrote:
Has anyone done any investigation into the performance implications of having large numbers of eggs installed? Is there any sort of performance hit?
It seems to me that having a really large path might slow down imports a bit, though I suspect this is in C code so probably not a significant problem.
If the eggs are zipped, the performance overhead for imports is negligible, although there is a small startup cost to read the zipfile indexes. Python caches zipfile indexes in memory, so checking whether a module is present is just a dictionary lookup and is much faster than having a directory on sys.path. If the eggs are *not* zipped, however, the performance impact on individual imports is much higher. That's why easy_install installs eggs zipped by default.
It also seems like there might be some startup penalties due to the overhead of setting up the path when using eggs, but this is a one-time cost during python startup, so probably not too bad either.
I'm asking because we're in the process to switching our open-source Enthought Tool Suite library to a distribution of components via eggs and we're having some internal debate as to whether we need to minimize the number of eggs or not. It definitely seems nice to have smaller subsets of functionality -- from the point of being able to make things stable, managing their APIs, managing cross-component dependencies, and from the user update size viewpoint. But are we paying a performance penalty for going too small in scope with our eggs?
I suggest you measure what you're concerned about. At one point, I did some timing tests that suggested that if you put the entire Cheeseshop on sys.path as zipped eggs, you might increase Python's startup time by a second or two. But your mileage may vary, and the Cheeseshop has increased a lot in size since then. ;) By the way, the long term plan for setuptools is that it should be able to install things the "old fashioned" way and still be able to manage them, using an installation manifest inside the .egg-info directories. In that way, you'd have all the benefits of separate distribution and a managed installation, as well as the benefits of having only one directory on sys.path. But I haven't done any work on implementing this yet. (Actually, now that I think of it, somebody could probably create a zc.buildout recipe to install eggs in an unpacked fashion (and let buildout handle uninstallation). The tricky part would be namespace package __init__ modules, since multiple eggs can share responsibility for the __init__, and this might confuse zc.buildout.)
participants (4)
-
Dave Peterson
-
Jim Fulton
-
Phillip J. Eby
-
Rick Ratzel