[Import-SIG] PEP proposal: Per-Module Import Path

Sat Jul 20 09:32:01 CEST 2013

On 20 July 2013 00:51, Brett Cannon <brett at python.org> wrote:
>> tricky to register the archive as a Python path entry when needed
>> without polluting the path of applications that don't need it.
>
> How is this unique to archive-based distributions compared to any other
> scenario where all distributions are blindly added to sys.path?

Other packages and modules are just *on* an already existing sys.path
entry. They're available, but unless you import them, they just sit
there on disk and don't bother you.

*.pth files in site-packages are different: *every* application run on
that Python installation (without -S) will have *new* entries added to
sys.path :(

>> One of the problems with working directly from a source checkout is
>> getting the relevant source directories onto the Python path, especially
>> when you have multiple namespace package fragments spread across several
>> subdirectories of a large repository.
>>
>
> E.g., a source checkout for the coverage.py project might be stored in the
> directory ``coveragepy``, but the actual source code is stored in
> ``coveragepy/coverage``, requiring ``coveragepy`` to be on sys.path in order
> to access the package.

Yep, exactly. With the PEP, you should be able to just do "echo
coveragepy > coverage.ref" in the current directory and the
interpreter will be able to follow the reference when you do "import
coverage", but it won't actually be added to sys.path.

A more complex example is the source layout for Beaker, where we have
separate Common, Client, Server, LabController and IntegrationTest
source directories that all contribute submodules to a shared "bkr"
namespace package. Getting those all on your path in a source checkout
currently means plenty of PYTHONPATH manipulation and various helper
scripts.

Because it's a namespace package, we can't do this with symlinks - a
"bkr" symlink could only reference one of the source directories, and
we have five. With ref files (and using Python 3.4+, which may happen
some day!), we'd be able to designate a recommended working directory
and put a single "bkr.ref" file under source control that included the
appropriate relative paths to make all the components available.

>> Once the path is exhausted, if no `loader` was found and the `namespace
>> portions` path is non-empty, then a `NamespaceLoader` is returned with that
>> path.
>>
>> This proposal inserts a step before step 1 for each `path entry`:
>>
>> 0. look for `<path entry>/<name>.ref`
>
>
> Why .ref? Why not .path?

I think of this proposal in general as "indirect imports", but the
notion of a "path reference" is what inspired the specific extension
(Eric just copied it from my original email).

I didn't actually consider other possible extensions (since I was
happy with .ref and it was the first one I thought of), but I think
.path in particular would be awfully confusing, since I (and others)
already refer to .pth files as "dot path" files.

>
> Everything below should come before the import changes. It's hard to follow
> what is really be proposed for semantics without  knowing e.g. .ref files
> can have 0 or more paths and just a single path, etc.

The proposal is indeed for what is effectively full blown recursive
path search, starting again from the top with sys.meta_path :)

The only new trick we should need is something to track which path
entries we have already considered so we can prevent recursive loops.

I don't know how Eric handled it in his draft implementation, but
here's the kind of thing I would do:

1. Define an IndirectReference type that can be returned instead of a loader
2. FileImporter would be aware of the new type, and if it sees it
instead of a loader:

- extracts the name of the reference file (to append to __indirect__)
- extracts the subpath (for the recursive descent)
- removes any previously seen path segments (including any from the
original path)
- if there are paths left, triggers the recursive search for a loader
(I was thinking at the sys.path_hooks level, but Eric suggested using
sys.meta_path instead)
- treat the result of the recursive search the same as you would a
search of a single path segment

I'm not currently sure what I would do about __indirect__ if the
result was a namespace package. In that case, you really want it to be
a mapping from path entries to the indirections that located them (if
any), which suggests we may want it to be a mapping whenever
indirection occurs (from __file__ to the indirection chain for the
simple case), or None if no indirection was involved in loading the
module.

You couldn't have the loader handle the recursion itself, or you'd
have trouble getting state from the previous loader to the next one
down. Thus the recursive->iterative transformation through the
"IndirectReference" type.

(Again, though, this is just me thinking out loud - Eric's the one
that created an actual draft implementation)

>> In order to facilitate that, modules will have a new attribute:
>> `__indirect__`.  It will be a tuple comprised of the chain of ref files, in
>> order, used to locate the module's __file__.  An empty tuple or with one
>> item will be the most common case.  An empty tuple indicates that no ref
>> files were used to locate the module.
>
> This complicates things even further. How are you going to pass this info
> along a call chain through find_loader()? Are we going to have to add
> find_loader3() to support this (nasty side-effect of using tuples instead of
> types.SimpleNamespace for the return value)? Some magic second value or type
> from find_loader() which flags the values in the iterable are from a .ref
> file and not any other possible place? This requires an API change and there
> isn't any mention of how that would look or work.

Good question. This was a last minute addition just before Eric posted
the draft. I still think it's something we should try to offer, and I
suspect whatever mechanism is put in place to prevent recursive loops
should be able to handle propagating this information (as described in
my sketch above).

>> This is an undesirable side effect of the way `*.pth` processing is
>> defined, but can't be changed due to backwards compatibility issues.
>>
>> Furthermore, `*.pth` files are processed at interpreter startup...
>
>
> That's a moot point; .ref files can be as well if they are triggered as part
> of an import.

The difference is that ref files will only be triggered for modules
you actually import. *Every* .pth file in site-packages is processed
at interpreter startup, with startup time implications for all Python
code run on that system.

> A bigger concern is that they execute arbitrary Python code which could be
> viewed as an unexpected security risk. Some might complain about the
> difficulty then of loading non-standard importers, but that really should be
> the duty of the code  performing the import and not the distribution itself;
> IOW I would argue that it is up to the user to get things in line to use a
> distribution in the format they choose to use it instead of the distribution
> dictating how it should be bundled.

Agreed, we should mention this (it's one of the reason Linux distros
aren't fond of *.pth files).

We should also note that, unlike *.pth files, *.ref files would work
even with the "-S" switch, since they don't rely on the site module
making additions to sys.path.

Cheers,
Nick.

--
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia