[Import-SIG] New draft revision for PEP 382

Fri Jul 8 21:51:39 CEST 2011

The following is my attempt at an updated draft of PEP 382, based on 
the recently-discussed changes.

To address the questions and criticisms raisd on Python-Dev when the 
PEP was introduced, I added an extended "Motivation" section that 
explains issues with the current approaches, and states the case for 
the PEP in more detail, including info about why anyone should care 
about namespace packages in the first place.  ;-)

I've also added a "Rejected Alternatives" section to document the 
other proposed approaches and the rationale for rejecting them in 
favor of the current proposal.

In addition, I've specified in a bit more detail the necessary 
changes to e.g. the pkgutil module.  (At least one open issue 
remains, however, and that is the question of what, if anything, 
should happen to the existing extend_path() function.  A second 
possible open question regards the API of the path fixup functions I 
propose in pkgutil.)

Anyway, your questions and comments, please!  The draft follows below:

PEP: 382
Title: Namespace Package Declarations
Version: $Revision$
Last-Modified: $Date$
Author: Martin v. LÃ¶wis <martin at v.loewis.de>, PJ Eby
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 02-Apr-2009
Python-Version: 3.2
Post-History:

Abstract
========

This PEP proposes an enhancement to Python's import machinery to
replace existing uses of the standard library's
``pkgutil.extend_path()`` API, and similar third-party APIs such as
``pkg_resources.declare_namespace()``.

The proposed enhancement will improve the reliability of existing
namespace package implementations, while providing "One Obvious Way"
to produce and consume namespace packages.

Terminology
===========

Within this PEP, the following terms are used as follows:

Package
     Python packages as defined by Python's import statement.

Distribution
     A separately installable set of Python modules, as registered in
     the Python package index, and installed by distutils, setuptools,
     etc.

Vendor Package
     A group of files installed by an operating system's packaging
     mechanism (e.g. Debian or Redhat packages installed on Linux
     systems).

Portion
     A set of files in a single directory (possibly inside a zip file
     or other storage mechanism) that contribute modules or subpackages
     to a namespace package.  The contents of each portion ``sys.path``

Namespace Package
     A package whose subpackages and modules can be split into portions
     that can be distributed or installed separately (via separate
     distributions and/or vendor packages), in shared or separate
     installation locations.

     Unlike a regular package, however, which only allows submodule
     and subpackage imports from a single location, a namespace
     package's ``__path__`` is configured so that submodules and
     subpackages can be imported from each of its installed portions,
     regardless of their relative positions in ``sys.path``.

Motivation
==========

.. epigraph::

     "Most packages are like modules.  Their contents are highly
     interdependent and can't be pulled apart.  [However,] some
     packages exist to provide a separate namespace. ...  It should
     be possible to distribute sub-packages or submodules of these
     [namespace packages] independently."

     -- Jim Fulton, shortly before the release of Python 2.3 [1]_

The Current Approach
--------------------

First introduced in Python 2.3, namespace packages are a mechanism
for splitting a single Python package across multiple directories
on disk.  This splitting has two main benefits:

1. It allows different parts of a large package or framework to be
    distributed and installed independently.  For example, installing
    the ``zope.interface`` package without having to install every
    package in the ``zope.*`` namespace.

    (This is somewhat similar to the way Perl's package system allows
    authors to separately distribute subpackages of ``File::`` or
    ``Email::``.)

2. As a side-effect of benefit 1, it reduces package naming collisions
    across multiple authors or organizations, by encouraging them to
    use distinguishing prefixes.  Instead of say, Zope and Twisted both
    offering a top-level ``interface`` package (in which case, both
    could not be installed to the same directory), they can use
    ``zope.interface`` and ``twisted.interface``, while still being
    able to distribute these subpackages separately from other ``zope``
    or ``twisted`` subpackages.

    (This is somewhat similar to the way Java uses names like
    ``org.apache.foobar`` or ``com.sun.thingy`` to prevent collisions,
    only flatter.)

In current Python versions, however, a registration function (such as
``pkgutil.extend_path()`` or ``pkg_resources.declare_namespace()``)
must be explicitly invoked in order to set up the package's
``__path__``.

There are two problems with this approach, however.

Problems With The Current Approach
----------------------------------

The first (and lesser) problem is that there is no One Obvious Way to
either declare that a package is a "namespace" or "module" package,
or to tell which kind of package a given directory on disk is.

Instead, you must choose one of the various APIs to use, each of
which is slightly-incompatible with the others.  (For example,
``pkgutil`` supports ``*.pkg`` files; setuptools doesn't.  Likewise,
setuptools supports package portions living in zip files, and adding
new path components to already-imported namespaces, whereas
``pkgutil`` doesn't.)

Similarly, to tell whether a given directory is a "namespace" or
"module" package, you must read its documentation or inspect its code
in detail, and be able to recognize the various API calls mentioned
above.

The second -- and much larger -- issue is that whichever API is used
to declare the namespace, the declaration has to be invoked from a
namespace package's ``__init__`` module in order to work.  (Otherwise,
only the first part of the package found on ``sys.path`` would be
importable.)

This clashes with the goal of separately installing portions of a
namespace, because then each distributed piece must include a copy
of the same ``__init__.py``.  (Otherwise, each piece would not be
importable on its own, as Python currently requires the existence
of an ``__init__`` module in order to import the package at all, let
alone set up the namespace!)

In addition to the developer inconvenience of creating, synchronizing,
and distributing these duplicated ``__init__`` modules, there is a
further problem created for operating system vendors.

Vendor packages typically must not provide overlapping files, and an
attempt to install a vendor package that has a file already on disk
will fail or cause unpredictable behavior.  As vendors might choose to
package distributions such that they will end up all in a single
directory for the namespace package, all portions would contribute
conflicting ``__init__.py`` files.

This issue has lead to various fragile and complex workarounds in
practice, such as ``.pth`` file abuse by setuptools, and the shipping
of broken partial packages with distutils.

With the enhancement proposed here, however, all of the above problems
can be readily resolved.

Specification
=============

Instead of an API call buried inside a series of duplicated and
potentially-clashing ``__init__`` modules (which mostly exist only
to make the package importable and declare its namespace-ness), this
PEP proposes that Python's import machinery be modified to include
direct support for namespace packages.

This support would work by adding a new way to desginate a directory
as containing a namespace package portion: by including one or more
``*.ns`` files in it.

This approach removes the need for an ``__init__`` module to be
duplicated across namespace package portions.  Instead, each portion
can simply include a uniquely-named ``*.ns`` file, thereby avoiding
filename clashes in vendor packages.

And, since the import machinery knows that these directories are
portions of a namespace package, it can automatically initialize
the package's ``__path__`` to include portions located on different
parts of ``sys.path``.  (Thus avoiding the need for special code
to be called in the ``__init__`` module.)

In addition to doing this path setup, the import machinery will also
add any imported namespace packages to ``sys.namespace_packages``
(initially an empty set), so that namespace packages can be identified
or iterated over.

PEP \302 Extension
------------------

The existing PEP 302 protocol is to be extended to handle namespace
package portion directories, by adding a new importer method,
``namespace_subpath(fullname)``.  An implementation of this method
will be added to all applicable importer classes distributed with
Python, including those in ``pkgutil`` and ``zipimport``).

(Note: any other importer wishing to support namespace packages must
provide its own implementation of this method as well.  If an importer
does not have a ``namespace_subpath()`` method, it will be treated as
if it *did* have the method, but it returned ``None`` when called.)

This new method is called just before the importer's ``find_module()``
is normally invoked.  If the importer determines that `fullname` is
a namespace package portion under its jurisdiction, then the importer
returns an importer-specific path to that namespace portion.

For example, if a standard filesystem path importer for the path
``/usr/lib/site-packages`` is about to be asked to import ``zope``,
and there is a ``/usr/lib/site-packages/zope`` directory containing
any files ending with ``.ns``, a call to ``namespace_subpath("zope")``
on that importer should return ``"/usr/lib/site-packages/zope"``.

However, if there is no such subdirectory, or it does *not* contain
any files whose names end with ``.ns``, that importer would return
``None`` instead.

The Python import machinery will call this method on each importer
corresponding to a path entry in ``sys.path`` (for top-level imports)
or in a parent package ``__path__`` (for subpackage imports).

If a normal package or module is found before a namespace package,
importing proceeds according to the normal PEP 302 protocol.  (That
is, a loader object is simply asked to load the located module or
package.)

However, if a namespace package portion is found (i.e., an importer's
``namespace_subpath()`` returns a string), then the normal import
search stops, and a namespace package is created instead.

The import machinery continues iterating over importers and calling
``namespace_subpath()`` on them, but it does **not** continue calling
``find_module()`` on them.  Instead, it accumulates any strings
returned by the subpath calls, in order to assemble a ``__path__``
for the package being imported.

(Note that this implies that any non-namespace packages with the same
name are skipped, and not included in the resulting package's
``__path__``.  In other words, a namespace package's initial
``__path__`` only includes namespace portions, never non-namespace
package directories.)

Once this ``__path__`` has been assembled, a module is created, and
its ``__path__`` attribute is set.  The package's name is then added
to ``sys.namespace_packages`` -- a set of package names.

Finally, the ``__init__`` module code for the package (if it exists)
is located and executed in the new module's namespace.

Each importer that returns a ``namespace_subpath()`` for the package
is asked to perform a standard ``find_module()`` for the package.
Since by the normal import rules, a directory containing an
``__init__`` module is a package, this call should succeed if the
namespace package portion contains an ``__init__`` module, and the
importing can proceed normally from that point.

There is one caveat, however.  The importers currently distributed
with Python expect that *they* will be the ones to initialize the
``__path__`` attribute, which means that they must be changed to
either recognize that ``__path__`` has already been set and not
change it, or to handle namespace packages specially (e.g., via an
internal flag or checking ``sys.namespace_packages``).

Similarly, any third-party importers wishing to support namespace
packages must make similar changes.

(NOTE: in general, it goes against the design of PEP 302 for a loader
object to assume that it is always creating the module object or that
the module it is operating on is empty.  Making this assumption can
result in code that breaks the normal operation of the ``reload()``
builtin and any specialized tools that rely on it, such as lazy
importers, automatic reloaders, and so on.)

Standard Library Changes/Additions
----------------------------------

The ``pkgutil`` module should be updated to handle this
specification appropriately, including any necessary changes to
``extend_path()``, ``iter_modules()``, etc.  A new generic API for
calling ``namespace_subpath()`` on importers should be added as well.

Specifically the proposed changes and additions are:

* A new ``namespace_subpath(importer, fullname)`` generic, allowing
   implementations to be registered for existing importers.

* A new ``extend_namespaces(path_entry)`` function, to extend existing
   and already-imported namespace packages' ``__path__`` attributes to
   include any portions found in a new ``sys.path`` entry.  This
   function should be called by applications extending ``sys.path``
   at runtime, e.g. to include a plugin directory or add an egg to the
   path.

   The implementation of this function does a simple breadth-first walk
   of ``sys.namespace_packages``, and performs any necessary
   ``namespace_subpath()`` calls to identify what path entries need to
   be added to each package's ``__path__``, given that `path_entry`
   has been added to ``sys.path``.

* A new ``iter_namespaces(parent='')`` function to allow breadth-first
   traversal of namespaces in ``sys.namespace_packages``, by yielding
   the child namespace packages of `parent`.  For example, calling
   ``iter_namespaces("zope")`` might yield ``zope.app`` and
   ``zope.products`` (if they are namespace packages registered in
   ``sys.namespace_packagess``), but **not** ``zope.foo.bar``.
   This function is needed to implement ``extend_namespaces()``, but
   is potentially useful to others.

* ``ImpImporter.iter_modules()`` should be changed to also detect and
   yield the names of namespace package portions.

In addition to the above changes, the ``zipimport`` importer should
have its ``iter_modules()`` implementation similarly changed.  (Note:
current versions of Python implement this via a shim in ``pkgutil``,
so technically this is also a change to ``pkgutil``.)

Implementation Notes
--------------------

For users, developers, and distributors of namespace packages:

* ``sys.namespace_packages`` is allowed to contain non-existent or
   not-yet-imported package names; code that uses its contents should
   not assume that every name in this set is also present in
   sys.packages or that importing the name will necessarily succeed.

* ``*.ns`` files must be empty or contain only ASCII whitespace
   characters.  This leaves open the possibility for future extension
   to the format.

* Files contained within a namespace package portion directory must
   be *unique* to that portion, so that the portion can be distributed
   as a vendor package without any filename overlap.  This applies to
   modules and data files as well as ``*.ns`` files.

   (For ``*.ns`` files themselves, uniqueness can be achieved simply by
   giving them a name based on the distribution that contains the file,
   and it is recommended that packaging tools support doing this
   automatically.)

* Although this PEP supports the use of non-empty ``__init__`` modules
   in namespace packages, their usage is controversial.  If more than
   one package portion contains an ``__init__`` module, at most one of
   them will be executed, possibly leading to silent errors.

   Therefore, if you must include an ``__init__`` module in your
   namespace package, make sure that it is provided by exactly **one**
   distribution, and that all other distributions using that module's
   contents are defined so as to have an installation dependency on
   the distribution containing the ``__init__`` module.  Otherwise,
   it may not be present in some installations.

   (Note: for historical reasons, existing namespace packages nearly
   always include ``__init__`` modules, but they are usually empty
   except for code to declare the package a namespace.  Under this
   proposal, these nearly-empty modules could and should be replaced
   by an empty ``*.ns`` file in the package directory.)

For those implementing PEP 302 importer objects:

* Importers that support the ``iter_modules()`` method and want to add
   namespace support should modify their ``iter_modules()``
   method so that it discovers and list namespace packages as well as
   standard modules and packages.

* For implementation efficiency, an importer is allowed to cache
   information (such as whether a directory exists and whether an
   ``__init__`` module is present in it) between the invocation of a
   ``namespace_subpath()`` call and a subsequent ``find_module()`` call
   for the same name.

   It should, however, avoid retaining such cached information for any
   longer than the next method call, and it should also verify that the
   request is in fact for the same module/package name, as it is not
   guaranteed that a ``namespace_subpath()`` call will always be
   followed by a matching ``find_module()`` call.  (After all, an
   ``__init__`` module may already have been supplied by an earlier
   importer on the path.)

* "Meta" importers (i.e., importers placed on ``sys.meta_path``) do
   not need to implement ``namespace_subpath()``, because the method
   is only called on importers corresponding to ``sys.path`` entries.'
   If a meta importer wishes to support namespace packages, it must
   do so entirely within its ``find_module()`` implementation.

   Unfortunately, it is unlikely that any such implementation will be
   able to merge its namespace portions with those of other meta
   importers or ``sys.path`` importers, so the meaning of "supporting
   namespace packages" for a meta importer is currently undefined.

   However, since the intended use case for meta importers is to
   replace Python's normal import process entirely for some subset of
   modules, and the number of such importers currently implemented is
   quite small, this seems unlikely to be a big issue in practice.

Rejected Alternatives
=====================

* The original version of this PEP used ``.pkg`` or ``.pth`` files
   that contained either explicit directories to be added to a
   package's ``__path__``, or ``*`` to indicate that a package was
   a namespace.

   But this approach required a more complex change to the importer
   protocol, the files had to actually be opened and read, and there
   were no concrete use cases proposed for the additional flexibility
   specifying explicit paths.

* On Python-Dev, M.A. Lemburg proposed [2]_ that instead of using
   extra files, namespace packages use a ``__pkg__.py`` file to
   indicate their namespace-ness, in addition to a (required)
   ``__init__.py``.

   Unfortunately, this approach solves only one of the `problems with
   the current approach`_: i.e., having a standard way of declaring and
   identifying namespace packages.  It does not address the necessity
   of distributing duplicated files, or filename overlap between
   distributions.  Further, it does not allow truly-independent
   namespace portions to exist, since it requires a "defining" portion
   (the portion containing the single ``__init__`` module) to exist.

* Another approach considered during revisions to this PEP was to
   simply rename package directories to add a suffix like ``.ns``
   or ``-ns``, to indicate their namespaced nature.  This would effect
   a small performance improvement for the initial import of a
   namespace package, avoid the need to create empty ``*.ns`` files,
   and even make it clearer that the directory involved is a namespace
   portion.

   The downsides, however, are also plentiful.  If a package starts
   its life as a normal package, it must be renamed when it becomes
   a namespace, with the implied consequences for revision control
   tools.

   Further, there is an immense body of existing code (including the
   distutils and many other packaging tools) that expect a package
   directory's name to be the same as the package name.  And porting
   existing Python 2.x namespace packages to Python 3 would require
   widespread directory renaming as well.

   In short, this approach would require a vastly larger number of
   changes to both the standard library and third-party code, for
   a tiny potential performance improvement and a small increase in
   clarity.  It was therefore rejected on "practicality vs. purity"
   grounds.

References
==========

.. [1] "namespace" vs "module" packages (mailing list thread)
    (http://mail.zope.org/pipermail/zope3-dev/2002-December/004251.html)

.. [2] "PEP \382: Namespace Packages" (mailing list thread)
    (http://mail.python.org/pipermail/python-dev/2009-April/088087.html)

Copyright
=========

This document has been placed in the public domain.

..
    Local Variables:
    mode: indented-text
    indent-tabs-mode: nil
    sentence-end-double-space: t
    fill-column: 70
    coding: utf-8
    End: