[Python-Dev] Draft PEP: "Simplified Package Layout and Partitioning"
P.J. Eby
pje at telecommunity.com
Thu Jul 21 15:20:53 CEST 2011
At 11:52 AM 7/21/2011 +1000, Nick Coghlan wrote:
>Trying to change how packages are identified at the Python level makes
>PEP 382 sound positively appealing. __path__ needs to stay :)
In which case, it should be a list, not a sentinel. ;-)
>Even better would be for these (and sys.path) to be list subclasses
>that did the right thing under the hood as Glenn suggested. Code that
>*replaces* rather than modifies these attributes would still
>potentially break virtual packages, but code that modifies them in
>place would do the right thing automatically. (Note that all code that
>manipulates sys.path and __path__ attributes requires explicit calls
>to correctly support current namespace package mechanisms, so this
>would actually be an improvement on the status quo rather than making
>anything worse).
I think the simplest thing, if we're keeping __path__ (and on
reflection, I think we should), would be to simply call
extend_virtual_paths() automatically on new path entries found in
sys.path when an import is performed, relative to the previous value
of sys.path.
That is, we save an "old" copy of sys.path somewhere, and whenever
__import__() is called (well, once it gets past checking if the
target is already in sys.modules, anyway), it checks the current
sys.path against it, and calls extend_virtual_paths() on any sys.path
entries that weren't in the "old" sys.path.
This is not the most efficient thing in the world, as it will cause a
bunch of stat calls to happen against the new directories, in the
middle of a possibly-entirely-unrelated import operation, but it
would certainly address the issue in the Simplest Way That Could Possibly Work.
A stricter (safer) version of the same thing would be one where we
only update __path__ values that are unchanged since we created them,
and rather than only appending new entries, we replace the __path__
with a newly-computed one.
This version is safer because it avoids corner cases like "I imported
foo.bar while foo.baz 1.1 was on my path, then I prepended a
directory to sys.path that has foo.baz 1.2, but I still get foo.baz
1.1 when I import." But it loses in cases where people do direct
__path__ manipulation.
On the other hand, it's a lot easier to say "you break it, you bought
it" where __path__ manipulation is concerned, so I'm actually pretty
inclined towards using the strict version.
Hey... here's a crazy idea. Suppose that a virtual package __path__
is a *tuple* instead of a list? Now, in order to change it, you
*have* to replace it. And we can cache the tuple we initially set it
to in sys.virtual_package_paths, so we can do an 'is' check before
replacing it.
Voila: __path__ still exists and is still a sequence for a virtual
path, but you have to explicitly replace it if you want to do
anything funky -- at which point you're responsible for maintaining it.
I'm tempted to say, "well, why not use a list-subclass proxy, then?",
but that means more work for no real difference. I just went through
dozens of examples of __path__ usage (found via Google), and I found
exactly two examples of code that modifies a __path__ that is not:
1. In the __init__.py whose __path__ it is (i.e., code that'll still
have a list), or
2. Modifying the __path__ of an explicitly-named self-contained
package that's part of the same distribution.
The two examples are from Twisted, and Google AppEngine. In the
Twisted case, it's some sort of namespace package-like plugin
chicanery, and in the AppEngine case, well, I'm not sure what the
heck it's doing, but it seems to be making sure that you can still
import stuff that has the same name as stdlib stuff, or something.
The Twisted case (and an apparent copy of the same code in a project
called "flumotion") uses ihooks, though, so I'm not sure it'll even
get executed for virtual packages. The Google case loops over
everything in sys.modules, in a function by the name of
appengine.dist.fix_paths()... but I wasn't able to find out who
calls this function, when and why.
So, pretty much, except for these bits of "nosy" code, the vast
majority of code out there seems to only mess with its own
self-contained paths, making the use of tuples seem like a pretty safe choice.
(Oh, and all the code I found that reads paths without modifying them
only use tuple-safe operations.)
So, if we implement automatic __path__ updates for virtual packages,
I'm currently leaning towards the strict approach using tuples, but
could possibly be persuaded towards read-only list-proxies instead.
Side note: it looks like a *lot* of code out there abuses __path__[0]
to find data files, so I probably need to add a note to the PEP about
not doing that when you convert a self-contained package to a virtual
one. Of course, I suppose using a sentinel could address *that*
problem, or an iteration-only proxy.
The main concern here is that using __path__[0] will *seem* to work
when you first use it with a virtual package, because it'll be the
right directory. But it'll be wrong long-term.
This seems to lean in favor of making a simple reiterable wrapper
type for the __path__, that only allows you to take the length and
iterate over it. With an appropriate design, it could actually
update itself automatically, given a subname and a parent
__path__/sys.path. That is, it could keep a tuple copy of the
last-seen parent path, and before iteration, compare
tuple(self.parent_path) to self.last_seen_path. If they're
different, it rebuilds the value to be iterated over.
Voila: transparent updating of all virtual __path__ values from
sys.path changes (or modifications to self-contained __path__
parents, btw), and trying to change it (or read an item from it
positionally) will not create any silent failures.
Alright... *if* we support automatic updates to virtual __paths__,
this is probably how we should do it. (It will require, though, that
imp.find_module be changed to use a different iteration method than
PyList_GetItem, as it's quite possible a virtual __path__ will get
passed into it.)
Also, we *long* ago passed the point where any of this can be sanely
backported to Python 2.x with a simple shim, alas. For my purposes
at least, needing a full importlib for the implementation is a
no-go. :-( Still, for the future of Python, this all makes good
sense. I just wish we'd thought of all this in 2006 when the
discussion came up before: we maybe could've had this in Python
2.6. Where's that damn time machine when you *really* need it? ;-)
More information about the Python-Dev
mailing list