[Python-Dev] Draft PEP: "Simplified Package Layout and Partitioning"

Thu Jul 21 15:20:53 CEST 2011

At 11:52 AM 7/21/2011 +1000, Nick Coghlan wrote:
>Trying to change how packages are identified at the Python level makes
>PEP 382 sound positively appealing. __path__ needs to stay :)

In which case, it should be a list, not a sentinel.  ;-)

>Even better would be for these (and sys.path) to be list subclasses
>that did the right thing under the hood as Glenn suggested. Code that
>*replaces* rather than modifies these attributes would still
>potentially break virtual packages, but code that modifies them in
>place would do the right thing automatically. (Note that all code that
>manipulates sys.path and __path__ attributes requires explicit calls
>to correctly support current namespace package mechanisms, so this
>would actually be an improvement on the status quo rather than making
>anything worse).

I think the simplest thing, if we're keeping __path__ (and on 
reflection, I think we should), would be to simply call 
extend_virtual_paths() automatically on new path entries found in 
sys.path when an import is performed, relative to the previous value 
of sys.path.

That is, we save an "old" copy of sys.path somewhere, and whenever 
__import__() is called (well, once it gets past checking if the 
target is already in sys.modules, anyway), it checks the current 
sys.path against it, and calls extend_virtual_paths() on any sys.path 
entries that weren't in the "old" sys.path.

This is not the most efficient thing in the world, as it will cause a 
bunch of stat calls to happen against the new directories, in the 
middle of a possibly-entirely-unrelated import operation, but it 
would certainly address the issue in the Simplest Way That Could Possibly Work.

A stricter (safer) version of the same thing would be one where we 
only update __path__ values that are unchanged since we created them, 
and rather than only appending new entries, we replace the __path__ 
with a newly-computed one.

This version is safer because it avoids corner cases like "I imported 
foo.bar while foo.baz 1.1 was on my path, then I prepended a 
directory to sys.path that has foo.baz 1.2, but I still get foo.baz 
1.1 when I import."  But it loses in cases where people do direct 
__path__ manipulation.

On the other hand, it's a lot easier to say "you break it, you bought 
it" where __path__ manipulation is concerned, so I'm actually pretty 
inclined towards using the strict version.

Hey...  here's a crazy idea.  Suppose that a virtual package __path__ 
is a *tuple* instead of a list?  Now, in order to change it, you 
*have* to replace it.  And we can cache the tuple we initially set it 
to in sys.virtual_package_paths, so we can do an 'is' check before 
replacing it.

Voila: __path__ still exists and is still a sequence for a virtual 
path, but you have to explicitly replace it if you want to do 
anything funky -- at which point you're responsible for maintaining it.

I'm tempted to say, "well, why not use a list-subclass proxy, then?", 
but that means more work for no real difference.  I just went through 
dozens of examples of __path__ usage (found via Google), and I found 
exactly two examples of code that modifies a __path__ that is not:

1. In the __init__.py whose __path__ it is (i.e., code that'll still 
have a list), or
2. Modifying the __path__ of an explicitly-named self-contained 
package that's part of the same distribution.

The two examples are from Twisted, and Google AppEngine.  In the 
Twisted case, it's some sort of namespace package-like plugin 
chicanery, and in the AppEngine case, well, I'm not sure what the 
heck it's doing, but it seems to be making sure that you can still 
import stuff that has the same name as stdlib stuff, or something.

The Twisted case (and an apparent copy of the same code in a project 
called "flumotion") uses ihooks, though, so I'm not sure it'll even 
get executed for virtual packages.  The Google case loops over 
everything in sys.modules, in a function by the name of 
appengine.dist.fix_paths()...  but I wasn't able to find out who 
calls this function, when and why.

So, pretty much, except for these bits of "nosy" code, the vast 
majority of code out there seems to only mess with its own 
self-contained paths, making the use of tuples seem like a pretty safe choice.

(Oh, and all the code I found that reads paths without modifying them 
only use tuple-safe operations.)

So, if we implement automatic __path__ updates for virtual packages, 
I'm currently leaning towards the strict approach using tuples, but 
could possibly be persuaded towards read-only list-proxies instead.

Side note: it looks like a *lot* of code out there abuses __path__[0] 
to find data files, so I probably need to add a note to the PEP about 
not doing that when you convert a self-contained package to a virtual 
one.  Of course, I suppose using a sentinel could address *that* 
problem, or an iteration-only proxy.

The main concern here is that using __path__[0] will *seem* to work 
when you first use it with a virtual package, because it'll be the 
right directory.  But it'll be wrong long-term.

This seems to lean in favor of making a simple reiterable wrapper 
type for the __path__, that only allows you to take the length and 
iterate over it.  With an appropriate design, it could actually 
update itself automatically, given a subname and a parent 
__path__/sys.path.  That is, it could keep a tuple copy of the 
last-seen parent path, and before iteration, compare 
tuple(self.parent_path) to self.last_seen_path.  If they're 
different, it rebuilds the value to be iterated over.

Voila: transparent updating of all virtual __path__ values from 
sys.path changes (or modifications to self-contained __path__ 
parents, btw), and trying to change it (or read an item from it 
positionally) will not create any silent failures.

Alright...  *if* we support automatic updates to virtual __paths__, 
this is probably how we should do it.  (It will require, though, that 
imp.find_module be changed to use a different iteration method than 
PyList_GetItem, as it's quite possible a virtual __path__ will get 
passed into it.)

Also, we *long* ago passed the point where any of this can be sanely 
backported to Python 2.x with a simple shim, alas.  For my purposes 
at least, needing a full importlib for the implementation is a 
no-go.  :-(  Still, for the future of Python, this all makes good 
sense.  I just wish we'd thought of all this in 2006 when the 
discussion came up before: we maybe could've had this in Python 
2.6.  Where's that damn time machine when you *really* need it?  ;-)