PEP 402: Simplified Package Layout and Partitioning

Hi, I’ve read PEP 402 and would like to offer comments. I know a bit about the import system, but not down to the nitty-gritty details of PEP 302 and __path__ computations and all this fun stuff (by which I mean, not fun at all). As such, I can’t find nasty issues in dark corners, but I can offer feedback as a user. I think it’s a very well-written explanation of a very useful feature: +1 from me. If it is accepted, the docs will certainly be much more concise, but the PEP as a thought process is a useful document to read. packaging/distribution/installation/deployment matters, not Python modules. I suggest “Python package semantics”. part is such a big step. Anyway, if the import-sig people say that users think it’s a complex or costly operation, I can believe it.
I wonder if importlib.import_module could implement the new import semantics all by itself, so that we can benefit from this PEP in older Pythons (importlib is on PyPI).
Besides, putting data files in a Python package is held very poorly by some (mostly people following the File Hierarchy Standard), and in distutils2/packaging, we (will) have a resources system that’s as convenient for users and more flexible for OS packagers. Using __file__ for more than information on the module is frowned upon for other reasons anyway (I talked about a Debian developer about this one day but forgot), so I think the limitation is okay.
Regards

On Aug 11, 2011, at 11:39 AM, Barry Warsaw wrote:
In some sense, I agree: hacks like empty strings are likely to lead to path-manipulation bugs where the wrong file gets opened (or worse, deleted, with predictable deleterious effects). But the whole "pure virtual" mechanism here seems to pile even more inconsistency on top of an already irritatingly inconsistent import mechanism. I was reasonably happy with my attempt to paper over PEP 302's weirdnesses from a user perspective: http://twistedmatrix.com/documents/11.0.0/api/twisted.python.modules.html (or https://launchpad.net/modules if you are not a Twisted user) Users of this API can traverse the module hierarchy with certain expectations; each module or package would have .pathEntry and .filePath attributes, each of which would refer to the appropriate place. Of course __path__ complicates things a bit, but so it goes. Now it seems like pure virtual packages are going to introduce a new type of special case into the hierarchy which have neither .pathEntry nor .filePath objects. Rather than a one-by-one ad-hoc consideration of which attribute should be set to None or empty strings or "<string>" or what have you, I'd really like to see a discussion in the PEP saying what a package really is vs. what a module is, and what one can reasonably expect from it from an API and tooling perspective. Right now I have to puzzle out the intent of the final API from the problem/solution description and thought experiment. Despite authoring several namespace packages myself, I don't have any of the problems described in the PEP. I just want to know how to write correct tools given this new specification. I suspect that this PEP will be the only reference for how packages work for a long time coming (just as PEP 302 was before it) so it should really get this right.

At 02:02 PM 8/11/2011 -0400, Glyph Lefkowitz wrote:
The assumption I've been working from is the only guarantee I've ever seen the Python docs give: i.e., that a package is a module object with a __path__ attribute. Modules aren't even required to have a __file__ object -- builtin modules don't, for example. (And the contents of __file__ are not required to have any particular semantics: PEP 302 notes that it can be a dummy value like "<frozen>", for example.) Technically, btw, PEP 302 requires __file__ to be a string, so making __file__ = None will be a backwards-incompatible change. But any code that walks modules in sys.modules is going to break today if it expects a __file__ attribute to exist, because 'sys' itself doesn't have one! So, my leaning is towards leaving off __file__, since today's code already has to deal with it being nonexistent, if it's working with arbitrary modules, and that'll produce breakage sooner rather than later -- the twisted.python.modules code, for example, would fail with a loud AttributeError, rather than going on to silently assume that a module with a dummy __file__ isn't a package. (Which is NOT a valid assumption *now*, btw, as I'll explain below.) Anyway, if you have any suggestions for verbiage that should be added to the PEP to clarify these assumptions, I'd be happy to add them. However, I think that the real problem you're encountering at the moment has more to do with making assumptions about the Python import ecosystem that aren't valid today, and haven't been valid since at least the introduction of PEP 302, if not earlier import hook systems as well.
I don't mean to be critical, and no doubt what you've written works fine for your current requirements, but on my quick attempt to skim through the code I found many things which appear to me to be incompatible with PEP 302. That is, the above code hardocdes a variety of assumptions about the import system that haven't been true since Python 2.3. (For example, it assumes that the contents of sys.path strings have inspectable semantics, that the contents of __file__ can tell you things about the module-ness or package-ness of a module object, etc.) If you want to fully support PEP 302, you might want to consider making this a wrapper over the corresponding pkgutil APIs (available since Python 2.5) that do roughly the same things, but which delegate all path string inspection to importer objects and allow extensible delegation for importers that don't support the optional methods involved. (Of course, if the pkgutil APIs are missing something you need, perhaps you could propose additions.)
The problem is that your API's notion that these things exist as coherent concepts was never really a valid assumption in the first place. .pth files and namespace packages already meant that the idea of a package coming from a single path entry made no sense. And namespace packages installed by setuptools' system packaging mode *don't have a __file__ attribute* today... heck they don't have __init__ modules, either. So, adding virtual packages isn't actually going to change anything, except perhaps by making these scenarios more common.

On Aug 12, 2011, at 11:24 AM, P.J. Eby wrote:
That is, the above code hardocdes a variety of assumptions about the import system that haven't been true since Python 2.3.
Thanks for this feedback. I honestly did not realize how old and creaky this code had gotten. It was originally developed for Python 2.4 and it certainly shows its age. Practically speaking, the code is correct for the bundled importers, and paths and zipfiles are all we've cared about thus far.
(For example, it assumes that the contents of sys.path strings have inspectable semantics, that the contents of __file__ can tell you things about the module-ness or package-ness of a module object, etc.)
Unfortunately, the primary goal of this code is to do something impossible - walk the module hierarchy without importing any code. So some heuristics are necessary. Upon further reflection, PEP 402 _will_ make dealing with namespace packages from this code considerably easier: we won't need to do AST analysis to look for a __path__ attribute or anything gross like that improve correctness; we can just look in various directories on sys.path and accurately predict what __path__ will be synthesized to be. However, the isPackage() method can and should be looking at the module if it's already loaded, and not always guessing based on paths. The whole reason there's an 'importPackages' flag to walk() is that some applications of this code care more about accuracy than others, so it tries to be as correct as it can be. (Of course this is still wrong for the case where a __path__ is dynamically constructed by user code, but there's only so well one can do at that.)
If you want to fully support PEP 302, you might want to consider making this a wrapper over the corresponding pkgutil APIs (available since Python 2.5) that do roughly the same things, but which delegate all path string inspection to importer objects and allow extensible delegation for importers that don't support the optional methods involved.
This code still needs to support Python 2.4, but I will make a note of this for future reference.
(Of course, if the pkgutil APIs are missing something you need, perhaps you could propose additions.)
Now it seems like pure virtual packages are going to introduce a new type of special case into the hierarchy which have neither .pathEntry nor .filePath objects.
The problem is that your API's notion that these things exist as coherent concepts was never really a valid assumption in the first place. .pth files and namespace packages already meant that the idea of a package coming from a single path entry made no sense. And namespace packages installed by setuptools' system packaging mode *don't have a __file__ attribute* today... heck they don't have __init__ modules, either.
The fact that getModule('sys') breaks is reason enough to re-visit some of these design decisions.
So, adding virtual packages isn't actually going to change anything, except perhaps by making these scenarios more common.
In that case, I guess it's a good thing; these bugs should be dealt with. Thanks for pointing them out. My opinion of PEP 402 has been completely reversed - although I'd still like to see a section about the module system from a library/tools author point of view rather than a time-traveling perl user's narrative :).

At 01:09 PM 8/12/2011 -0400, Glyph Lefkowitz wrote:
The flip side of that is that you can't always know whether a directory is a virtual package without deep inspection: one consequence of PEP 402 is that any directory that contains a Python module (of whatever type), however deeply nested, will be a valid package name. So, you can't rule out that a given directory *might* be a package, without walking its entire reachable subtree. (Within the subset of directory names that are valid Python identifiers, of course.) However, you *can* quickly tell that a directory *might* be a package or is *probably* one: if it contains modules, or is the same name as an already-discovered module, it's a pretty safe bet that you can flag it as such. In any case, you probably should *not* do the building of a virtual path yourself; the protocols and APIs added by PEP 402 should allow you to simply ask for the path to be constructed on your behalf. Otherwise, you are going to be back in the same business of second-guessing arbitrary importer backends again! (E.g. note that PEP 402 does not say virtual package subpaths must be filesystem or zipfile subdirectories of their parents - an importer could just as easily allow you to treat subdirectories named 'twisted.python' as part of a virtual package with that name!) Anyway, pkgutil defines some extra methods that importers can implement to support module-walking, and part of the PEP 402 implementation should be to make this support virtual packages as well.
This code still needs to support Python 2.4, but I will make a note of this for future reference.
A suggestion: just take the pkgutil code and bundle it for Python 2.4 as something._pkgutil. There's very little about it that's 2.5+ specific, at least when I wrote the bits that do the module walking. Of course, the main disadvantage of pkgutil for your purposes is that it currently requires packages to be imported in order to walk their child modules. (IIRC, it does *not*, however, require them to be imported in order to discover their existence.)
LOL. If you will propose the wording you'd like to see, I'll be happy to check it for any current-and-or-future incorrect assumptions. ;-)

On Aug 12, 2011, at 2:33 PM, P.J. Eby wrote:
Are there any rules about passing invalid identifiers to __import__ though, or is that just less likely? :)
However, you *can* quickly tell that a directory *might* be a package or is *probably* one: if it contains modules, or is the same name as an already-discovered module, it's a pretty safe bet that you can flag it as such.
I still like the idea of a 'marker' file. It would be great if there were a new marker like "__package__.py". I say this more for the benefit of users looking at a directory on their filesystem and trying to understand whether this is a package or not than I do for my own programmatic tools though; it's already hard enough to understand the package-ness of a part of your filesystem and its interactions with PYTHONPATH; making directories mysteriously and automatically become packages depending on context will worsen that situation, I think. I also have this not-terribly-well-defined idea that it would be handy for different providers of the _contents_ of namespace packages to provide their own instrumentation to be made aware that they've been added to the __path__ of a particular package. This may be a solution in search of a problem, but I imagine that each __package__.py would be executed in the same module namespace. This would allow namespace packages to do things like set up compatibility aliases, lazy imports, plugin registrations, etc, as they currently do with __init__.py. Perhaps it would be better to define its relationship to the package-module namespace in a more sensible way than "execute all over each other in no particular order". Also, if I had my druthers, Python would raise an exception if someone added a directory marked as a package to sys.path, to refuse to import things from it, and when a submodule was run as a script, add the nearest directory not marked as a package to sys.path, rather than the script's directory itself. The whole "__name__ is wrong because your current directory was wrong when you ran that command" thing is so confusing to explain that I hope we can eventually consign it to the dustbin of history. But if you can't even reasonably guess whether a directory is supposed to be an entry on sys.path or a package, that's going to be really hard to do.
In any case, you probably should *not* do the building of a virtual path yourself; the protocols and APIs added by PEP 402 should allow you to simply ask for the path to be constructed on your behalf. Otherwise, you are going to be back in the same business of second-guessing arbitrary importer backends again!
What do you mean "building of a virtual path"?
(E.g. note that PEP 402 does not say virtual package subpaths must be filesystem or zipfile subdirectories of their parents - an importer could just as easily allow you to treat subdirectories named 'twisted.python' as part of a virtual package with that name!)
Anyway, pkgutil defines some extra methods that importers can implement to support module-walking, and part of the PEP 402 implementation should be to make this support virtual packages as well.
The more that this can focus on module-walking without executing code, the happier I'll be :).
One of the stipulations of this code is that it might give different results when the modules are loaded and not. So it's fine to inspect that first and then invoke pkgutil only in the 'loaded' case, with the knowledge that the not-loaded case may be incorrect in the face of certain configurations.
If I can come up with anything I will definitely send it along. -glyph

At 05:03 PM 8/12/2011 -0400, Glyph Lefkowitz wrote:
Are there any rules about passing invalid identifiers to __import__ though, or is that just less likely? :)
I suppose you have a point there. ;-)
I still like the idea of a 'marker' file. It would be great if there were a new marker like "__package__.py".
Having any required marker file makes separately-installable portions of a package impossible, since it would then be in conflict at installation time. The (semi-)competing proposal, PEP 382, is based on allowing each portion to have a differently-named marker; we came up with PEP 402 as a way to get rid of the need for any marker files (not to mention the bikeshedding involved.)
What do you mean "building of a virtual path"?
Constructing the __path__-to-be of a not-yet-imported virtual package. The PEP defines a protocol for constructing this, by asking the importer objects to provide __path__ entries, and it does not require anything to be imported. So there's no reason to re-implement the algorithm yourself.
The more that this can focus on module-walking without executing code, the happier I'll be :).
Virtual packages actually improve on this situation, in that a virtual path can be computed without the need to import the package. (Assuming a submodule or subpackage doesn't munge the __path__, of course.)

On Thu, 11 Aug 2011 11:39:52 -0400 Barry Warsaw <barry@python.org> wrote:
None should be the answer. It simplifies inspection of module data (repr(__file__) gives you something recognizable instead of raising) and makes semantically sense (!) since there is, indeed, no actual file backing the module. Regards Antoine.

At 04:39 PM 8/11/2011 +0200, Ãric Araujo wrote:
Hi,
I've read PEP 402 and would like to offer comments.
Thanks.
Changing to "Python package import semantics" to hopefully be even clearer. ;-) (Nitpick: I was somewhat intentionally ambiguous because we are talking here about how a package is physically implemented in the filesystem, and that actually *is* kind of a packaging issue. But it's not necessarily a *useful* intentional ambiguity, so I've no problem with removing it.)
It's not that it's complex or costly in anything other than *mental* overhead -- you have to remember to do it and it's not particularly obvious. (But people on import-sig did mention this and other things covered by the PEP as being a frequent root cause of beginner inquiries on #python, Stackoverflow, et al.)
Since each package's __path__ is the same length or shorter than its parent's by default, then if you put a virtual package inside a self-contained one, it will be functionally speaking no different than a self-contained one, in that it will have only one path entry. So, it's not really useful to put a virtual package inside a self-contained one, even though you can do it. (Apart form it letting you avoid a superfluous __init__ module, assuming it's indeed superfluous.)
It *is* possible - you'd just have to put it in a "zc.py" file. IOW, this PEP still allows "namespace-defining packages" to exist, as was requested by early commenters on PEP 382. It just doesn't *require* them to exist in order for the namespace contents to be importable.
Well, I rather *like* having them there, personally, vs. having to learn yet another API, but oh well, whatever. AFAIK, ImportEngine isn't going to do away with the need for the global ones to live somewhere, at least not in 3.3.
As written in the current proposal, yes. There was some discussion on Python-Dev about having this happen automatically, and I proposed that it could be done by making virtual packages' __path__ attributes an iterable proxy object, rather than a list: http://mail.python.org/pipermail/python-dev/2011-July/112429.html (This is an open option that hasn't been added to the PEP as yet, because I wanted to know Guido's thoughts on the proposal as it stands before burdening it with more implementation detail for a feature (automatic updates) that he might not be very keen on to begin with, even it does make the semantics that much more familiar for Perl or PHP users.)
Is it a useful thing? Dunno. That's why it's open for comment. If the auto-update approach is used, then the __path__ of virtual packages will have a distinguishable type(). plan was just to create a specific pep382 module to include with future versions of setuptools, but as things worked out, I'm not sure if that'll be sanely doable for pep402.
Not so long as you passed it a package name instead of a module name. This issue exists today with namespace pacakges; it's not new to virtual packages.
Are those same people similarly concerned when a Firefox extension contains image files as well as JavaScript? And if not, why is Python different? IOW, I think that those people are being confused by our use of the term "data" and thus think of it as an entirely different sort of "data" than what is meant by "package data" in the Python world. I am not sure what word would unconfuse (defuse?) them, but we simply mean "files that are part of the package but are not of a type that Python can import by default," not "user-modifiable data" or "data that has meaning or usefulness to code other than the code it was packaged with." Perhaps "package-embedded resources" would be a better phrase? Certainly, it implies that they're *supposed* to be embedded there. ;-)
Done.

On Fri, Aug 12, 2011 at 4:30 AM, P.J. Eby <pje@telecommunity.com> wrote:
And likely not for the entire 3.x series - I shudder at the thought of the backwards incompatibility hell associated with trying to remove them... The point of the ImportEngine API is that the caching elements of the import state introduce cross dependencies between various global data structures. Code that manipulates those data structures needs to correctly invalidate or otherwise update the state as things change. I seem to recall a certain programming construct that is designed to make it easier to manage interdependent data structures... Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi, Going through my email backlog.
I still don’t understand why this matters or what negative effects it could have on code, but I’m fine with not understanding. I’ll trust that people writing or maintaining import-related tools will agree or complain about that item.
That’s quite cool. I guess such a namespace-defining module (zc.py here) would be importable, right? Also, would it cause worse performance for other zc.* packages than if there were no zc.py?
Agreed with “whatever” :) I just like to grunt sometimes.
AFAIK, ImportEngine isn't going to do away with the need for the global ones to live somewhere,
Yep, but as Nick replied, at least we’ll gain one structure to rule them all.
That sounds a bit too complicated. What about just having pkgutil.extend_virtual_paths call sys.path.append? For maximum flexibility, extend_virtual_paths could have an argument to avoid calling sys.path.append.
A good example is documentation: Having a unique location (/usr/share/doc) for all installed software makes my life easier. Another example is JavaScript files used with HTML documents, such as jQuery: Debian recently split the jQuery file out of their Sphinx package, so that there is only one library installed that all packages can use and that can be updated and fixed once for all. (I’m simplifying; there can be multiple versions of libraries, but not multiple copies. I’ll stop here; I’m not one of the authors of the Filesystem Hierarchy Standard, and I’ll rant against package_data in distutils mailing lists :)
A pure virtual package having no source file, I think it should have no __file__ at all.
Antoine and someone else thought likewise (I can find the link if you want); do you consider it consensus enough to update the PEP? Regards

On Sat, Nov 26, 2011 at 11:53 AM, Éric Araujo <merwok@netwok.org> wrote:
Yes.
Also, would it cause worse performance for other zc.* packages than if there were no zc.py?
No. The first import of a subpackage sets up the __path__, and all subsequent imports use it.
A pure virtual package having no source file, I think it should have no
Sure. At this point, though, before doing any more work on the PEP I'd like to have some idea of whether there's any chance of it being accepted. At this point, there seems to be a lot of passive, "Usenet nod syndrome" type support for it, but little active support. It doesn't help at all that I'm not really in a position to provide an implementation, and the persons most likely to implement have been leaning somewhat towards 382, or wanting to modify 402 such that it uses .pyp directory extensions so that PEP 395 can be supported... And while 402 is an extension of an idea that Guido proposed a few years ago, he hasn't weighed in lately on whether he still likes that idea, let alone whether he likes where I've taken it. ;-)

If this helps, I am +1, and I’m sure other devs will chime in. I think the feature is useful, and I prefer 402’s way to 382’s pyp directories.
If that's the obstacle to adopting PEP 382, it would be easy to revert the PEP back to having file markers to indicate package-ness. I insist on having markers of some kind, though (IIUC, this is also what PEP 395 requires). The main problem with file markers is that a) they must not overlap across portions of a package, and b) the actual file name and content is irrelevant. a) means that package authors have to come up with some name, and b) means that the name actually doesn't matter (but the file name extension would). UUIDs would work, as would the name of the portion/distribution. I think the specific choice of name will confuse people into interpreting things in the file name that aren't really intended. Regards, Martin

On Thu, Dec 1, 2011 at 1:28 AM, PJ Eby <pje@telecommunity.com> wrote:
While I was initially a fan of the possibilities of PEP 402, I eventually decided that we would be trading an easy problem ("you need an '__init__.py' marker file or a '.pyp' extension to get Python to recognise your package directory") for a hard one ("What's your sys.path look like? What did you mean for it to look like?"). Symlinks (and the fact we implicitly call realname() during system initialisation and import) just make things even messier. *Deliberately* allowing package structures on the filesystem to become ambiguous is a recipe for future pain (and could potentially undo a lot of the good work done by PEP 328's elimination of implicit relative imports). I acknowledge there is a lot of confusion amongst novices as to how packages and imports actually work, but my diagnosis of the root cause of that problem is completely different from that supposed by PEP 402 (as documented in the more recent versions of PEP 395, I've come to believe it is due to the way we stuff up the default sys.path[0] initialisation when packages are involved). So, in the end, I've come to strongly prefer the PEP 382 approach. The principle of "Explicit is better than implicit" applies to package detection on the filesystem just as much as it does to any other kind of API design, and it really isn't that different from the way we treat actual Python files (i.e. you can *execute* arbitrary files, but they need to have an appropriate extension if you want to import them). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Nov 30, 2011, at 6:39 PM, Nick Coghlan wrote:
I've helped an almost distressing number of newbies overcome their confusion about sys.path and packages. Systems using Twisted are, almost by definition, hairy integration problems, and are frequently being created or maintained by people with little to no previous Python experience. Given that experience, I completely agree with everything you've written above (except for the part where you initially liked it). I appreciate the insight that PEP 402 offers about python's package mechanism (and the difficulties introduced by namespace packages). Its statement of the problem is good, but in my opinion its solution points in exactly the wrong direction: packages need to be _more_ explicit about their package-ness and tools need to be stricter about how they're laid out. It would be great if sys.path[0] were actually correct when running a script inside a package, or at least issued a warning which would explain how to correctly lay out said package. I would love to see a loud alarm every time a module accidentally got imported by the same name twice. I wish I knew, once and for all, whether it was 'import Image' or 'from PIL import Image'. My hope is that if Python starts to tighten these things up a bit, or at least communicate better about best practices, editors and IDEs will develop better automatic discovery features and frameworks will start to normalize their sys.path setups and stop depending on accidents of current directory and script location. This will in turn vastly decrease confusion among new python developers taking on large projects with a bunch of libraries, who mostly don't care what the rules for where files are supposed to go are, and just want to put them somewhere that works. -glyph

Éric Araujo <merwok <at> netwok.org> writes:
The FHS does not apply in all scenarios - not all Python code is deployed/packaged at system level. For example, plug-ins (such as Django apps) are often not meant to be installed by a system-level packager. This might also be true in scenarios where Python is embedded into some other application. It's really useful to be able to co-locate packages with their data (e.g. in a zip file) and I don't think all instances of putting data files in a package are to be frowned upon. Regards, Vinay Sajip

On Aug 11, 2011, at 11:39 AM, Barry Warsaw wrote:
In some sense, I agree: hacks like empty strings are likely to lead to path-manipulation bugs where the wrong file gets opened (or worse, deleted, with predictable deleterious effects). But the whole "pure virtual" mechanism here seems to pile even more inconsistency on top of an already irritatingly inconsistent import mechanism. I was reasonably happy with my attempt to paper over PEP 302's weirdnesses from a user perspective: http://twistedmatrix.com/documents/11.0.0/api/twisted.python.modules.html (or https://launchpad.net/modules if you are not a Twisted user) Users of this API can traverse the module hierarchy with certain expectations; each module or package would have .pathEntry and .filePath attributes, each of which would refer to the appropriate place. Of course __path__ complicates things a bit, but so it goes. Now it seems like pure virtual packages are going to introduce a new type of special case into the hierarchy which have neither .pathEntry nor .filePath objects. Rather than a one-by-one ad-hoc consideration of which attribute should be set to None or empty strings or "<string>" or what have you, I'd really like to see a discussion in the PEP saying what a package really is vs. what a module is, and what one can reasonably expect from it from an API and tooling perspective. Right now I have to puzzle out the intent of the final API from the problem/solution description and thought experiment. Despite authoring several namespace packages myself, I don't have any of the problems described in the PEP. I just want to know how to write correct tools given this new specification. I suspect that this PEP will be the only reference for how packages work for a long time coming (just as PEP 302 was before it) so it should really get this right.

At 02:02 PM 8/11/2011 -0400, Glyph Lefkowitz wrote:
The assumption I've been working from is the only guarantee I've ever seen the Python docs give: i.e., that a package is a module object with a __path__ attribute. Modules aren't even required to have a __file__ object -- builtin modules don't, for example. (And the contents of __file__ are not required to have any particular semantics: PEP 302 notes that it can be a dummy value like "<frozen>", for example.) Technically, btw, PEP 302 requires __file__ to be a string, so making __file__ = None will be a backwards-incompatible change. But any code that walks modules in sys.modules is going to break today if it expects a __file__ attribute to exist, because 'sys' itself doesn't have one! So, my leaning is towards leaving off __file__, since today's code already has to deal with it being nonexistent, if it's working with arbitrary modules, and that'll produce breakage sooner rather than later -- the twisted.python.modules code, for example, would fail with a loud AttributeError, rather than going on to silently assume that a module with a dummy __file__ isn't a package. (Which is NOT a valid assumption *now*, btw, as I'll explain below.) Anyway, if you have any suggestions for verbiage that should be added to the PEP to clarify these assumptions, I'd be happy to add them. However, I think that the real problem you're encountering at the moment has more to do with making assumptions about the Python import ecosystem that aren't valid today, and haven't been valid since at least the introduction of PEP 302, if not earlier import hook systems as well.
I don't mean to be critical, and no doubt what you've written works fine for your current requirements, but on my quick attempt to skim through the code I found many things which appear to me to be incompatible with PEP 302. That is, the above code hardocdes a variety of assumptions about the import system that haven't been true since Python 2.3. (For example, it assumes that the contents of sys.path strings have inspectable semantics, that the contents of __file__ can tell you things about the module-ness or package-ness of a module object, etc.) If you want to fully support PEP 302, you might want to consider making this a wrapper over the corresponding pkgutil APIs (available since Python 2.5) that do roughly the same things, but which delegate all path string inspection to importer objects and allow extensible delegation for importers that don't support the optional methods involved. (Of course, if the pkgutil APIs are missing something you need, perhaps you could propose additions.)
The problem is that your API's notion that these things exist as coherent concepts was never really a valid assumption in the first place. .pth files and namespace packages already meant that the idea of a package coming from a single path entry made no sense. And namespace packages installed by setuptools' system packaging mode *don't have a __file__ attribute* today... heck they don't have __init__ modules, either. So, adding virtual packages isn't actually going to change anything, except perhaps by making these scenarios more common.

On Aug 12, 2011, at 11:24 AM, P.J. Eby wrote:
That is, the above code hardocdes a variety of assumptions about the import system that haven't been true since Python 2.3.
Thanks for this feedback. I honestly did not realize how old and creaky this code had gotten. It was originally developed for Python 2.4 and it certainly shows its age. Practically speaking, the code is correct for the bundled importers, and paths and zipfiles are all we've cared about thus far.
(For example, it assumes that the contents of sys.path strings have inspectable semantics, that the contents of __file__ can tell you things about the module-ness or package-ness of a module object, etc.)
Unfortunately, the primary goal of this code is to do something impossible - walk the module hierarchy without importing any code. So some heuristics are necessary. Upon further reflection, PEP 402 _will_ make dealing with namespace packages from this code considerably easier: we won't need to do AST analysis to look for a __path__ attribute or anything gross like that improve correctness; we can just look in various directories on sys.path and accurately predict what __path__ will be synthesized to be. However, the isPackage() method can and should be looking at the module if it's already loaded, and not always guessing based on paths. The whole reason there's an 'importPackages' flag to walk() is that some applications of this code care more about accuracy than others, so it tries to be as correct as it can be. (Of course this is still wrong for the case where a __path__ is dynamically constructed by user code, but there's only so well one can do at that.)
If you want to fully support PEP 302, you might want to consider making this a wrapper over the corresponding pkgutil APIs (available since Python 2.5) that do roughly the same things, but which delegate all path string inspection to importer objects and allow extensible delegation for importers that don't support the optional methods involved.
This code still needs to support Python 2.4, but I will make a note of this for future reference.
(Of course, if the pkgutil APIs are missing something you need, perhaps you could propose additions.)
Now it seems like pure virtual packages are going to introduce a new type of special case into the hierarchy which have neither .pathEntry nor .filePath objects.
The problem is that your API's notion that these things exist as coherent concepts was never really a valid assumption in the first place. .pth files and namespace packages already meant that the idea of a package coming from a single path entry made no sense. And namespace packages installed by setuptools' system packaging mode *don't have a __file__ attribute* today... heck they don't have __init__ modules, either.
The fact that getModule('sys') breaks is reason enough to re-visit some of these design decisions.
So, adding virtual packages isn't actually going to change anything, except perhaps by making these scenarios more common.
In that case, I guess it's a good thing; these bugs should be dealt with. Thanks for pointing them out. My opinion of PEP 402 has been completely reversed - although I'd still like to see a section about the module system from a library/tools author point of view rather than a time-traveling perl user's narrative :).

At 01:09 PM 8/12/2011 -0400, Glyph Lefkowitz wrote:
The flip side of that is that you can't always know whether a directory is a virtual package without deep inspection: one consequence of PEP 402 is that any directory that contains a Python module (of whatever type), however deeply nested, will be a valid package name. So, you can't rule out that a given directory *might* be a package, without walking its entire reachable subtree. (Within the subset of directory names that are valid Python identifiers, of course.) However, you *can* quickly tell that a directory *might* be a package or is *probably* one: if it contains modules, or is the same name as an already-discovered module, it's a pretty safe bet that you can flag it as such. In any case, you probably should *not* do the building of a virtual path yourself; the protocols and APIs added by PEP 402 should allow you to simply ask for the path to be constructed on your behalf. Otherwise, you are going to be back in the same business of second-guessing arbitrary importer backends again! (E.g. note that PEP 402 does not say virtual package subpaths must be filesystem or zipfile subdirectories of their parents - an importer could just as easily allow you to treat subdirectories named 'twisted.python' as part of a virtual package with that name!) Anyway, pkgutil defines some extra methods that importers can implement to support module-walking, and part of the PEP 402 implementation should be to make this support virtual packages as well.
This code still needs to support Python 2.4, but I will make a note of this for future reference.
A suggestion: just take the pkgutil code and bundle it for Python 2.4 as something._pkgutil. There's very little about it that's 2.5+ specific, at least when I wrote the bits that do the module walking. Of course, the main disadvantage of pkgutil for your purposes is that it currently requires packages to be imported in order to walk their child modules. (IIRC, it does *not*, however, require them to be imported in order to discover their existence.)
LOL. If you will propose the wording you'd like to see, I'll be happy to check it for any current-and-or-future incorrect assumptions. ;-)

On Aug 12, 2011, at 2:33 PM, P.J. Eby wrote:
Are there any rules about passing invalid identifiers to __import__ though, or is that just less likely? :)
However, you *can* quickly tell that a directory *might* be a package or is *probably* one: if it contains modules, or is the same name as an already-discovered module, it's a pretty safe bet that you can flag it as such.
I still like the idea of a 'marker' file. It would be great if there were a new marker like "__package__.py". I say this more for the benefit of users looking at a directory on their filesystem and trying to understand whether this is a package or not than I do for my own programmatic tools though; it's already hard enough to understand the package-ness of a part of your filesystem and its interactions with PYTHONPATH; making directories mysteriously and automatically become packages depending on context will worsen that situation, I think. I also have this not-terribly-well-defined idea that it would be handy for different providers of the _contents_ of namespace packages to provide their own instrumentation to be made aware that they've been added to the __path__ of a particular package. This may be a solution in search of a problem, but I imagine that each __package__.py would be executed in the same module namespace. This would allow namespace packages to do things like set up compatibility aliases, lazy imports, plugin registrations, etc, as they currently do with __init__.py. Perhaps it would be better to define its relationship to the package-module namespace in a more sensible way than "execute all over each other in no particular order". Also, if I had my druthers, Python would raise an exception if someone added a directory marked as a package to sys.path, to refuse to import things from it, and when a submodule was run as a script, add the nearest directory not marked as a package to sys.path, rather than the script's directory itself. The whole "__name__ is wrong because your current directory was wrong when you ran that command" thing is so confusing to explain that I hope we can eventually consign it to the dustbin of history. But if you can't even reasonably guess whether a directory is supposed to be an entry on sys.path or a package, that's going to be really hard to do.
In any case, you probably should *not* do the building of a virtual path yourself; the protocols and APIs added by PEP 402 should allow you to simply ask for the path to be constructed on your behalf. Otherwise, you are going to be back in the same business of second-guessing arbitrary importer backends again!
What do you mean "building of a virtual path"?
(E.g. note that PEP 402 does not say virtual package subpaths must be filesystem or zipfile subdirectories of their parents - an importer could just as easily allow you to treat subdirectories named 'twisted.python' as part of a virtual package with that name!)
Anyway, pkgutil defines some extra methods that importers can implement to support module-walking, and part of the PEP 402 implementation should be to make this support virtual packages as well.
The more that this can focus on module-walking without executing code, the happier I'll be :).
One of the stipulations of this code is that it might give different results when the modules are loaded and not. So it's fine to inspect that first and then invoke pkgutil only in the 'loaded' case, with the knowledge that the not-loaded case may be incorrect in the face of certain configurations.
If I can come up with anything I will definitely send it along. -glyph

At 05:03 PM 8/12/2011 -0400, Glyph Lefkowitz wrote:
Are there any rules about passing invalid identifiers to __import__ though, or is that just less likely? :)
I suppose you have a point there. ;-)
I still like the idea of a 'marker' file. It would be great if there were a new marker like "__package__.py".
Having any required marker file makes separately-installable portions of a package impossible, since it would then be in conflict at installation time. The (semi-)competing proposal, PEP 382, is based on allowing each portion to have a differently-named marker; we came up with PEP 402 as a way to get rid of the need for any marker files (not to mention the bikeshedding involved.)
What do you mean "building of a virtual path"?
Constructing the __path__-to-be of a not-yet-imported virtual package. The PEP defines a protocol for constructing this, by asking the importer objects to provide __path__ entries, and it does not require anything to be imported. So there's no reason to re-implement the algorithm yourself.
The more that this can focus on module-walking without executing code, the happier I'll be :).
Virtual packages actually improve on this situation, in that a virtual path can be computed without the need to import the package. (Assuming a submodule or subpackage doesn't munge the __path__, of course.)

On Thu, 11 Aug 2011 11:39:52 -0400 Barry Warsaw <barry@python.org> wrote:
None should be the answer. It simplifies inspection of module data (repr(__file__) gives you something recognizable instead of raising) and makes semantically sense (!) since there is, indeed, no actual file backing the module. Regards Antoine.

At 04:39 PM 8/11/2011 +0200, Ãric Araujo wrote:
Hi,
I've read PEP 402 and would like to offer comments.
Thanks.
Changing to "Python package import semantics" to hopefully be even clearer. ;-) (Nitpick: I was somewhat intentionally ambiguous because we are talking here about how a package is physically implemented in the filesystem, and that actually *is* kind of a packaging issue. But it's not necessarily a *useful* intentional ambiguity, so I've no problem with removing it.)
It's not that it's complex or costly in anything other than *mental* overhead -- you have to remember to do it and it's not particularly obvious. (But people on import-sig did mention this and other things covered by the PEP as being a frequent root cause of beginner inquiries on #python, Stackoverflow, et al.)
Since each package's __path__ is the same length or shorter than its parent's by default, then if you put a virtual package inside a self-contained one, it will be functionally speaking no different than a self-contained one, in that it will have only one path entry. So, it's not really useful to put a virtual package inside a self-contained one, even though you can do it. (Apart form it letting you avoid a superfluous __init__ module, assuming it's indeed superfluous.)
It *is* possible - you'd just have to put it in a "zc.py" file. IOW, this PEP still allows "namespace-defining packages" to exist, as was requested by early commenters on PEP 382. It just doesn't *require* them to exist in order for the namespace contents to be importable.
Well, I rather *like* having them there, personally, vs. having to learn yet another API, but oh well, whatever. AFAIK, ImportEngine isn't going to do away with the need for the global ones to live somewhere, at least not in 3.3.
As written in the current proposal, yes. There was some discussion on Python-Dev about having this happen automatically, and I proposed that it could be done by making virtual packages' __path__ attributes an iterable proxy object, rather than a list: http://mail.python.org/pipermail/python-dev/2011-July/112429.html (This is an open option that hasn't been added to the PEP as yet, because I wanted to know Guido's thoughts on the proposal as it stands before burdening it with more implementation detail for a feature (automatic updates) that he might not be very keen on to begin with, even it does make the semantics that much more familiar for Perl or PHP users.)
Is it a useful thing? Dunno. That's why it's open for comment. If the auto-update approach is used, then the __path__ of virtual packages will have a distinguishable type(). plan was just to create a specific pep382 module to include with future versions of setuptools, but as things worked out, I'm not sure if that'll be sanely doable for pep402.
Not so long as you passed it a package name instead of a module name. This issue exists today with namespace pacakges; it's not new to virtual packages.
Are those same people similarly concerned when a Firefox extension contains image files as well as JavaScript? And if not, why is Python different? IOW, I think that those people are being confused by our use of the term "data" and thus think of it as an entirely different sort of "data" than what is meant by "package data" in the Python world. I am not sure what word would unconfuse (defuse?) them, but we simply mean "files that are part of the package but are not of a type that Python can import by default," not "user-modifiable data" or "data that has meaning or usefulness to code other than the code it was packaged with." Perhaps "package-embedded resources" would be a better phrase? Certainly, it implies that they're *supposed* to be embedded there. ;-)
Done.

On Fri, Aug 12, 2011 at 4:30 AM, P.J. Eby <pje@telecommunity.com> wrote:
And likely not for the entire 3.x series - I shudder at the thought of the backwards incompatibility hell associated with trying to remove them... The point of the ImportEngine API is that the caching elements of the import state introduce cross dependencies between various global data structures. Code that manipulates those data structures needs to correctly invalidate or otherwise update the state as things change. I seem to recall a certain programming construct that is designed to make it easier to manage interdependent data structures... Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi, Going through my email backlog.
I still don’t understand why this matters or what negative effects it could have on code, but I’m fine with not understanding. I’ll trust that people writing or maintaining import-related tools will agree or complain about that item.
That’s quite cool. I guess such a namespace-defining module (zc.py here) would be importable, right? Also, would it cause worse performance for other zc.* packages than if there were no zc.py?
Agreed with “whatever” :) I just like to grunt sometimes.
AFAIK, ImportEngine isn't going to do away with the need for the global ones to live somewhere,
Yep, but as Nick replied, at least we’ll gain one structure to rule them all.
That sounds a bit too complicated. What about just having pkgutil.extend_virtual_paths call sys.path.append? For maximum flexibility, extend_virtual_paths could have an argument to avoid calling sys.path.append.
A good example is documentation: Having a unique location (/usr/share/doc) for all installed software makes my life easier. Another example is JavaScript files used with HTML documents, such as jQuery: Debian recently split the jQuery file out of their Sphinx package, so that there is only one library installed that all packages can use and that can be updated and fixed once for all. (I’m simplifying; there can be multiple versions of libraries, but not multiple copies. I’ll stop here; I’m not one of the authors of the Filesystem Hierarchy Standard, and I’ll rant against package_data in distutils mailing lists :)
A pure virtual package having no source file, I think it should have no __file__ at all.
Antoine and someone else thought likewise (I can find the link if you want); do you consider it consensus enough to update the PEP? Regards

On Sat, Nov 26, 2011 at 11:53 AM, Éric Araujo <merwok@netwok.org> wrote:
Yes.
Also, would it cause worse performance for other zc.* packages than if there were no zc.py?
No. The first import of a subpackage sets up the __path__, and all subsequent imports use it.
A pure virtual package having no source file, I think it should have no
Sure. At this point, though, before doing any more work on the PEP I'd like to have some idea of whether there's any chance of it being accepted. At this point, there seems to be a lot of passive, "Usenet nod syndrome" type support for it, but little active support. It doesn't help at all that I'm not really in a position to provide an implementation, and the persons most likely to implement have been leaning somewhat towards 382, or wanting to modify 402 such that it uses .pyp directory extensions so that PEP 395 can be supported... And while 402 is an extension of an idea that Guido proposed a few years ago, he hasn't weighed in lately on whether he still likes that idea, let alone whether he likes where I've taken it. ;-)

If this helps, I am +1, and I’m sure other devs will chime in. I think the feature is useful, and I prefer 402’s way to 382’s pyp directories.
If that's the obstacle to adopting PEP 382, it would be easy to revert the PEP back to having file markers to indicate package-ness. I insist on having markers of some kind, though (IIUC, this is also what PEP 395 requires). The main problem with file markers is that a) they must not overlap across portions of a package, and b) the actual file name and content is irrelevant. a) means that package authors have to come up with some name, and b) means that the name actually doesn't matter (but the file name extension would). UUIDs would work, as would the name of the portion/distribution. I think the specific choice of name will confuse people into interpreting things in the file name that aren't really intended. Regards, Martin

On Thu, Dec 1, 2011 at 1:28 AM, PJ Eby <pje@telecommunity.com> wrote:
While I was initially a fan of the possibilities of PEP 402, I eventually decided that we would be trading an easy problem ("you need an '__init__.py' marker file or a '.pyp' extension to get Python to recognise your package directory") for a hard one ("What's your sys.path look like? What did you mean for it to look like?"). Symlinks (and the fact we implicitly call realname() during system initialisation and import) just make things even messier. *Deliberately* allowing package structures on the filesystem to become ambiguous is a recipe for future pain (and could potentially undo a lot of the good work done by PEP 328's elimination of implicit relative imports). I acknowledge there is a lot of confusion amongst novices as to how packages and imports actually work, but my diagnosis of the root cause of that problem is completely different from that supposed by PEP 402 (as documented in the more recent versions of PEP 395, I've come to believe it is due to the way we stuff up the default sys.path[0] initialisation when packages are involved). So, in the end, I've come to strongly prefer the PEP 382 approach. The principle of "Explicit is better than implicit" applies to package detection on the filesystem just as much as it does to any other kind of API design, and it really isn't that different from the way we treat actual Python files (i.e. you can *execute* arbitrary files, but they need to have an appropriate extension if you want to import them). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Nov 30, 2011, at 6:39 PM, Nick Coghlan wrote:
I've helped an almost distressing number of newbies overcome their confusion about sys.path and packages. Systems using Twisted are, almost by definition, hairy integration problems, and are frequently being created or maintained by people with little to no previous Python experience. Given that experience, I completely agree with everything you've written above (except for the part where you initially liked it). I appreciate the insight that PEP 402 offers about python's package mechanism (and the difficulties introduced by namespace packages). Its statement of the problem is good, but in my opinion its solution points in exactly the wrong direction: packages need to be _more_ explicit about their package-ness and tools need to be stricter about how they're laid out. It would be great if sys.path[0] were actually correct when running a script inside a package, or at least issued a warning which would explain how to correctly lay out said package. I would love to see a loud alarm every time a module accidentally got imported by the same name twice. I wish I knew, once and for all, whether it was 'import Image' or 'from PIL import Image'. My hope is that if Python starts to tighten these things up a bit, or at least communicate better about best practices, editors and IDEs will develop better automatic discovery features and frameworks will start to normalize their sys.path setups and stop depending on accidents of current directory and script location. This will in turn vastly decrease confusion among new python developers taking on large projects with a bunch of libraries, who mostly don't care what the rules for where files are supposed to go are, and just want to put them somewhere that works. -glyph

Éric Araujo <merwok <at> netwok.org> writes:
The FHS does not apply in all scenarios - not all Python code is deployed/packaged at system level. For example, plug-ins (such as Django apps) are often not meant to be installed by a system-level packager. This might also be true in scenarios where Python is embedded into some other application. It's really useful to be able to co-locate packages with their data (e.g. in a zip file) and I don't think all instances of putting data files in a package are to be frowned upon. Regards, Vinay Sajip
participants (10)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Barry Warsaw
-
Glyph
-
Glyph Lefkowitz
-
Nick Coghlan
-
P.J. Eby
-
PJ Eby
-
Vinay Sajip
-
Éric Araujo