Writing importers and path hooks

On 27 March 2013 21:19, Bradley M. Froehle <brad.froehle@gmail.com> wrote:
Apologies for hijacking the thread, but it's interesting that you implemented your hook like this. I notice that you didn't use any of the importlib functionality in doing so. Was there a particular reason? I ask because a few days ago, I was writing a very similar importer, as I wanted to try a proof of concept importer based on the new importlib stuff (which is intended to make writing custom importers easier), and I really struggled to get something working. It seems to me that the importlib documentation doesn't help much for people trying to import path hooks. But it might be just me. Does anyone have an example of a simple importlib-based finder/loader? That would be a huge help for me. In return for any pointers, I'll look at putting together a doc patch to clarify how to use importlib to build your own path hooks :-) Thanks, Paul

On Wed, Mar 27, 2013 at 6:59 PM, Paul Moore <p.f.moore@gmail.com> wrote:
Struggling how? With the finder? The loader? What exactly were you trying to accomplish and how were you deviating from the standard import system?
It seems to me that the importlib documentation doesn't help much for people trying to import path hooks.
There is a bug to clarify the docs to have more geared towards writing new importers instead of just documenting what's available: http://bugs.python.org/issue15867
But it might be just me. Does anyone have an example of a simple importlib-based finder/loader?
Define simple. =) I would argue importlib itself is easy enough to read.
Do you specifically mean the path hook aspect or the whole package of hook, finder, and loader?

On 28 March 2013 13:42, Brett Cannon <brett@python.org> wrote:
What I was trying to do was to write a path hook that would allow me to add a string to sys.path which contained a base-64 encoded zipfile (plus some sort of marker so it could be distinguished from a normal path entry) and as a result the contents of that embedded zip file would be available as if I'd added an actual zip file with that content to sys.path. I got in a complete mess as I tried to strip out the (essentially non-interesting) zipfile handling to get to a dummy "do nothing, everything is valid" type of example. But I don't think I would have fared much better if I'd stuck to the original full requirement.
Thanks. I'll keep an eye on that.
:-) Fair point. I guess importlib is not *that* hard to read, but the only case that implements packages in the filesystem one, and that also deals with C extensions and other complexities that I don't have a need for. I'll try again to have a deeper look at it, but I didn't find it easy to extract the essentials when I looked before.
OK, after some more digging, it looks like I misunderstood the process somewhat. Writing a loader that inherits from *both* FileLoader and SourceLoader, and then implementing get_data (and module_repr - why do I need that, couldn't the ABC offer a default implementation?) does the job for that. But the finder confuses me. I assume I want a PathEntryFinder and hence I should implement find_loader(). The documentation on what I need to return from there is very sparse... In the end I worked out that for a package, I need to return (MyLoader(modulename, 'foo/__init__.py'), ['foo']) (here, "foo" is my dummy marker for my example). In essence, PathEntryFinder really has to implement some form of virtual filesystem mount point, and preserve the standard filesystem semantics of modules having a filename of .../__init__.py. So I managed to work out what was needed in the end, but it was a lot harder than I'd expected. On reflection, getting the finder semantics right (and in particular the path entry finder semantics) was the hard bit. I'm now 100% sure that some cookbook examples would help a lot. I'll see what I can do. Thanks, Paul

On Thu, Mar 28, 2013 at 11:38 AM, Paul Moore <p.f.moore@gmail.com> wrote:
You only need SourceLoader since you are dealing with Python source. You don't need FileLoader since you are not reading from disk but an in-memory zipfile.
and then implementing get_data (and module_repr - why do I need that, couldn't the ABC offer a default implementation?)
http://bugs.python.org/issue17093 and http://bugs.python.org/issue17566
does the job for that.
You should be implementing get_data, get_filename, and path_stats for SourceLoader.
But the finder confuses me. I assume I want a PathEntryFinder and hence I should implement find_loader().
Yes since you are pulling from sys.path.
The second argument should just be None: "An empty list can be used for portion to signify the loader is not part of a [namespace] package". Unfortunately a key word is missing in that sentence. http://bugs.python.org/issue17567
Well, if your zip file decided to create itself with a different file extension then it wouldn't be required, but then other people's code might break if they don't respect module abstractions (i.e. looking at __package__/__name__ or __path__ to see if something is a package).
Yep, that bit has had the least API tweaks as most people don't muck with finders but with loaders.
I'm now 100% sure that some cookbook examples would help a lot. I'll see what I can do.
I plan on writing a pure Python zip importer for Python 3.4 which should be fairly minimal and work out as a good example chunk of code. And no one need bother writing it as I'm going to do it myself regardless to make sure I plug any missing holes in the API. If you really want something to try for fun go for a sqlite3-backed setup (don't see it going in the stdlib but it would be a project to have).

On 28 March 2013 16:08, Brett Cannon <brett@python.org> wrote:
OK, cool. That helps a lot. The biggest gap here is that I don't think that anywhere has a good explanation of the required semantics of get_filename - particularly where we're not actually dealing with real filenames. My initial stab at this would be: A module name is a dot-separated list of parts. A filename is an arbitrary token that can be used with get_data to get the module content. However, the following rules should be followed: - Filenames should be made up of parts separated by the OS path separator. - For packages, the final section of the filename *must* be __init__.py if the standard package detection is being used. - The initial part of the filename needs to match your path entry if submodule lookups are going to work sanely In practice, you need to implement filenames as if your finder is managing a virtual filesystem mounted at your sys.path entry, with module->filename semantics being the usual subdirectory layout. And packages have a basename of __init__.py. I'd like to know how to implement packages without the artificial __init__.py (something like a sqlite database can attach content and an "is_package" flag to the same entry). But that's advanced usage, and I can probably hack around until I work out how to do that now.
Ha. Yes, that makes a lot of difference :-) Did you mean None or [], by the way?
I'm not quite sure what you mean by this, but I take your point about making sure to break people's expectations as little as possible...
Hmm. I'm not sure how you can ever write a loader without needing to write an associated finder. The existing finders wouldn't return your loader, surely?
I'm pretty sure I'll write a zip importer first - it feels like one of those essential but largely useless exercises that people have to start with - a bit like scales on the piano :-) But I'd be interested in trying a sqlite importer as well. I might well see how I go with that. Thanks for the help with this. Paul

On Thu, Mar 28, 2013 at 12:33 PM, Paul Moore <p.f.moore@gmail.com> wrote:
It's because there aren't any. =) This is the first time alternative storage mechanisms are really easily viable without massive amounts of work, so no one has figured this out. The real question is how code out in the wild would react if you did something like /path/to/sqlite3:pkg.mod which is very much not a file path.
And why is that? A database doesn't need those separators as the module name would just be the primary key.
- For packages, the final section of the filename *must* be __init__.py if the standard package detection is being used.
Once again, why? A column in a database that is nothing more than a package flag would solve this as well, negating the need for this. The whole point of is_package() on loaders is to get away from this reliance on __file__ having any meaning beyond "this is the string that represents where this module's code was loaded from".
- The initial part of the filename needs to match your path entry if submodule lookups are going to work sanely
When applicable that's fine.
That's one way of doing it, but it does very much tie imports to files and it doesn't generalize the concept to places where file paths simply do not need to apply.
Define is_package(). I personally want to change the API somehow so you ask for what __path__ should be set to. Unfortunately without going down the "False means not a package, everything else means it is and what is returned should be set on __path__" is a bit hairy and not backwards-compatible unless you require a list that always evaluates to True for packages.
Empty list. You can check the code to see if it would work with None, but a list is expected to be used so an empty list is more consistent and still false.
To tell if a module is a package, you should do either ``if mod.__name__ == mod.__package__`` or ``if hasattr(mod, '__path__')``.
If you are not changing the storage mechanism you don't need a new finder; what importlib provides works fine. So if you are, for instance, only providing a loader which does an AST optimization pass you only need a new loader. Or if you use a DSL that you compile into Python code then you only need a new loader.
The sqlite3 one is interesting as it does not whatsoever require file paths to operate; you can easily define a schema specific to source code and bytecode and really go db-specific and have the loader work from that (would also make finder lookups dead-simple). Otherwise you will end up writing a schema for a virtual filesystem which would also work but would show that people are not respecting abstractions on modules (or that the API has gaps which need filling in).

On Fri, Mar 29, 2013 at 3:39 AM, Brett Cannon <brett@python.org> wrote:
To tell if a module is a package, you should do either ``if mod.__name__ == mod.__package__`` or ``if hasattr(mod, '__path__')``.
The second of those is actually a bit more reliable. As with many import quirks, the answer to "But why?" is "Because __main__" :P Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Wed, Mar 27, 2013 at 6:59 PM, Paul Moore <p.f.moore@gmail.com> wrote:
Struggling how? With the finder? The loader? What exactly were you trying to accomplish and how were you deviating from the standard import system?
It seems to me that the importlib documentation doesn't help much for people trying to import path hooks.
There is a bug to clarify the docs to have more geared towards writing new importers instead of just documenting what's available: http://bugs.python.org/issue15867
But it might be just me. Does anyone have an example of a simple importlib-based finder/loader?
Define simple. =) I would argue importlib itself is easy enough to read.
Do you specifically mean the path hook aspect or the whole package of hook, finder, and loader?

On 28 March 2013 13:42, Brett Cannon <brett@python.org> wrote:
What I was trying to do was to write a path hook that would allow me to add a string to sys.path which contained a base-64 encoded zipfile (plus some sort of marker so it could be distinguished from a normal path entry) and as a result the contents of that embedded zip file would be available as if I'd added an actual zip file with that content to sys.path. I got in a complete mess as I tried to strip out the (essentially non-interesting) zipfile handling to get to a dummy "do nothing, everything is valid" type of example. But I don't think I would have fared much better if I'd stuck to the original full requirement.
Thanks. I'll keep an eye on that.
:-) Fair point. I guess importlib is not *that* hard to read, but the only case that implements packages in the filesystem one, and that also deals with C extensions and other complexities that I don't have a need for. I'll try again to have a deeper look at it, but I didn't find it easy to extract the essentials when I looked before.
OK, after some more digging, it looks like I misunderstood the process somewhat. Writing a loader that inherits from *both* FileLoader and SourceLoader, and then implementing get_data (and module_repr - why do I need that, couldn't the ABC offer a default implementation?) does the job for that. But the finder confuses me. I assume I want a PathEntryFinder and hence I should implement find_loader(). The documentation on what I need to return from there is very sparse... In the end I worked out that for a package, I need to return (MyLoader(modulename, 'foo/__init__.py'), ['foo']) (here, "foo" is my dummy marker for my example). In essence, PathEntryFinder really has to implement some form of virtual filesystem mount point, and preserve the standard filesystem semantics of modules having a filename of .../__init__.py. So I managed to work out what was needed in the end, but it was a lot harder than I'd expected. On reflection, getting the finder semantics right (and in particular the path entry finder semantics) was the hard bit. I'm now 100% sure that some cookbook examples would help a lot. I'll see what I can do. Thanks, Paul

On Thu, Mar 28, 2013 at 11:38 AM, Paul Moore <p.f.moore@gmail.com> wrote:
You only need SourceLoader since you are dealing with Python source. You don't need FileLoader since you are not reading from disk but an in-memory zipfile.
and then implementing get_data (and module_repr - why do I need that, couldn't the ABC offer a default implementation?)
http://bugs.python.org/issue17093 and http://bugs.python.org/issue17566
does the job for that.
You should be implementing get_data, get_filename, and path_stats for SourceLoader.
But the finder confuses me. I assume I want a PathEntryFinder and hence I should implement find_loader().
Yes since you are pulling from sys.path.
The second argument should just be None: "An empty list can be used for portion to signify the loader is not part of a [namespace] package". Unfortunately a key word is missing in that sentence. http://bugs.python.org/issue17567
Well, if your zip file decided to create itself with a different file extension then it wouldn't be required, but then other people's code might break if they don't respect module abstractions (i.e. looking at __package__/__name__ or __path__ to see if something is a package).
Yep, that bit has had the least API tweaks as most people don't muck with finders but with loaders.
I'm now 100% sure that some cookbook examples would help a lot. I'll see what I can do.
I plan on writing a pure Python zip importer for Python 3.4 which should be fairly minimal and work out as a good example chunk of code. And no one need bother writing it as I'm going to do it myself regardless to make sure I plug any missing holes in the API. If you really want something to try for fun go for a sqlite3-backed setup (don't see it going in the stdlib but it would be a project to have).

On 28 March 2013 16:08, Brett Cannon <brett@python.org> wrote:
OK, cool. That helps a lot. The biggest gap here is that I don't think that anywhere has a good explanation of the required semantics of get_filename - particularly where we're not actually dealing with real filenames. My initial stab at this would be: A module name is a dot-separated list of parts. A filename is an arbitrary token that can be used with get_data to get the module content. However, the following rules should be followed: - Filenames should be made up of parts separated by the OS path separator. - For packages, the final section of the filename *must* be __init__.py if the standard package detection is being used. - The initial part of the filename needs to match your path entry if submodule lookups are going to work sanely In practice, you need to implement filenames as if your finder is managing a virtual filesystem mounted at your sys.path entry, with module->filename semantics being the usual subdirectory layout. And packages have a basename of __init__.py. I'd like to know how to implement packages without the artificial __init__.py (something like a sqlite database can attach content and an "is_package" flag to the same entry). But that's advanced usage, and I can probably hack around until I work out how to do that now.
Ha. Yes, that makes a lot of difference :-) Did you mean None or [], by the way?
I'm not quite sure what you mean by this, but I take your point about making sure to break people's expectations as little as possible...
Hmm. I'm not sure how you can ever write a loader without needing to write an associated finder. The existing finders wouldn't return your loader, surely?
I'm pretty sure I'll write a zip importer first - it feels like one of those essential but largely useless exercises that people have to start with - a bit like scales on the piano :-) But I'd be interested in trying a sqlite importer as well. I might well see how I go with that. Thanks for the help with this. Paul

On Thu, Mar 28, 2013 at 12:33 PM, Paul Moore <p.f.moore@gmail.com> wrote:
It's because there aren't any. =) This is the first time alternative storage mechanisms are really easily viable without massive amounts of work, so no one has figured this out. The real question is how code out in the wild would react if you did something like /path/to/sqlite3:pkg.mod which is very much not a file path.
And why is that? A database doesn't need those separators as the module name would just be the primary key.
- For packages, the final section of the filename *must* be __init__.py if the standard package detection is being used.
Once again, why? A column in a database that is nothing more than a package flag would solve this as well, negating the need for this. The whole point of is_package() on loaders is to get away from this reliance on __file__ having any meaning beyond "this is the string that represents where this module's code was loaded from".
- The initial part of the filename needs to match your path entry if submodule lookups are going to work sanely
When applicable that's fine.
That's one way of doing it, but it does very much tie imports to files and it doesn't generalize the concept to places where file paths simply do not need to apply.
Define is_package(). I personally want to change the API somehow so you ask for what __path__ should be set to. Unfortunately without going down the "False means not a package, everything else means it is and what is returned should be set on __path__" is a bit hairy and not backwards-compatible unless you require a list that always evaluates to True for packages.
Empty list. You can check the code to see if it would work with None, but a list is expected to be used so an empty list is more consistent and still false.
To tell if a module is a package, you should do either ``if mod.__name__ == mod.__package__`` or ``if hasattr(mod, '__path__')``.
If you are not changing the storage mechanism you don't need a new finder; what importlib provides works fine. So if you are, for instance, only providing a loader which does an AST optimization pass you only need a new loader. Or if you use a DSL that you compile into Python code then you only need a new loader.
The sqlite3 one is interesting as it does not whatsoever require file paths to operate; you can easily define a schema specific to source code and bytecode and really go db-specific and have the loader work from that (would also make finder lookups dead-simple). Otherwise you will end up writing a schema for a virtual filesystem which would also work but would show that people are not respecting abstractions on modules (or that the API has gaps which need filling in).

On Fri, Mar 29, 2013 at 3:39 AM, Brett Cannon <brett@python.org> wrote:
To tell if a module is a package, you should do either ``if mod.__name__ == mod.__package__`` or ``if hasattr(mod, '__path__')``.
The second of those is actually a bit more reliable. As with many import quirks, the answer to "But why?" is "Because __main__" :P Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (3)
-
Brett Cannon
-
Nick Coghlan
-
Paul Moore