[Import-SIG] Loading Resources From a Python Module/Package

Donald Stufft donald at stufft.io
Mon Feb 2 14:31:39 CET 2015


> On Feb 1, 2015, at 12:28 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> 
> On 1 February 2015 at 08:27, Brett Cannon <brett at python.org> wrote:
>> As I said above, I partially feel like the desire for this support is to
>> work around some API decisions that are somewhat poor.
>> 
>> How about this: get_path(package, path, *, real=False) or get_path(package,
>> filename, *, real=False) -- depending on whether Barry and me get our way
>> about paths or you do, Donald -- where 'real' is a flag specifying whether
>> the path has to work as a path argument to builtins.open() and thus fails
>> accordingly (in instances where it won't work it can fail immediately and so
>> loader implementers only have two lines of code to care about to manage it).
>> Then loaders can keep their get_data() method without issue and the API for
>> loaders only grew by 1 (or stays constant depending on whether we want/can
>> have it subsume get_filename() long-term).
> 
> Jumping in here, since I specifically object to the "real=<boolean
> flag>" API design concept (on the grounds of that the presence of that
> kind of flag means you have two different methods trying to get out),
> this thread is already quite long and there are several different
> aspects I'd like to comment on :)
> 
> * I like the overall naming suggestion of referring to this as a
> "resources" API. That not only has precedent in pkg_resources, but is
> also the standard terminology for referring to this kind of thing in
> rich client applications. (see
> https://msdn.microsoft.com/en-us/library/windows/apps/hh465241.aspx
> for example)
> 
> * I think the PEP 302 approach of referring to resource anchors as
> "paths" is inherently confusing, especially when the most common
> anchor is __file__. As a result, I think we should refer to "resource
> anchors" and "relative paths", rather than the current approach of
> trying to create and pass around "absolute paths" (which then end up
> only working properly when packages are installed to a real
> filesystem).
> 
> * I think Donald's overview at https://bpaste.net/show/0c490aa07c07 is
> a good summary of the functionality we should aim to provide (naming
> bikesheds aside)
> 
> * I agree we should treat extraction and loading of C extension
> modules (and shared libraries in general) as out of scope for the
> resource API. They face several restrictions that don't apply to other
> pure data files
> 
> * I agree that the resource APIs should be for read-only access only.
> Images, localisation strings, application templates, those are the
> kinds of things this API is aimed at: they're an essential part of the
> application, and hence it's appropriate to bundle them with it in a
> way that still works for single-file zip archive applications, but
> they're not Python code.
> 
> * For the "must exist as a real shareable filesystem artefact, fail
> immediately if that isn't possible" API, I think we should support
> both implicit cleanup *and* explicit context managers for
> deterministic resource control. "Make this available until I'm done
> with it, regardless of where I use it" and "make this available for
> this defined region of code" are different use cases. Depending on how
> these objects are modelled in the API (more on that below), we could
> potentially drop the atexit handler in favour of suitable
> weakref.finalize() calls (which would then clean them up once the last
> reference to the resource was dropped, rather than always waiting
> until the end of the process - "keep this resource available until the
> process ends" would then be a matter of reference it from the
> appropriate module globals or some other similarly long lived data
> structure). Leaks due to process crashes would then be cleaned up by
> normal OS tempfile management processes.
> 
> * I don't think we should couple the concept of resource anchors
> directly to package names (as discussed, it doesn't work for namespace
> packages, for example). I think we *should* be able to *look up*
> resource anchors by package name, although this may fail in some cases
> (such as namespace packages), and that the top level API should do
> that lookup implicitly (allowing package names to be passed wherever
> an anchor is expected). A module object should also be usable as its
> own anchor. I believe we should disallow the use of filesystem paths
> as resource anchors, as that breaks the intended abstraction (looking
> resources up relative to the related modules), and the API behaviour
> is clearer if strings are always assumed to be referring to
> package/module names.
> 
> * I *don't* think it's a good idea to incorporate this idea directly
> onto the existing module Loader API. Better to create a new
> "ResourceLoader" abstraction, such that we can easily provide a
> default LocationResourceLoader. Reusing module Loader instances across
> modules would still be permitted, reusing ResourceLoader instances
> *would not*. This allows the resource anchor to be specified when
> creating the resource loader, rather than on every call.
> 
> * As a consequence of the previous point, the ResourceLoader instance
> would be linked *from the module spec* (and perhaps from the module
> globals), rather than from the module loader instance. (This is how we
> would support using a module as its own anchor). Having a resource
> loader defined in the spec would be optional, making it clear that
> namespace modules (for example), don't provide a resource access API -
> if you want to store resources inside a namespace package, you need to
> create a submodule or self-contained subpackage to serve as the
> resource anchor.
> 
> * As a consequence of making a suitably configured resource loader
> available through the module spec as part of the module finding
> process it would become possible to access module relative resources
> *without actually loading the module itself*.
> 
> * If the import system gets a module spec where "spec.has_location" is
> set and Loader.get_data is available, but the new
> "spec.resource_loader" attribute is set to None, then it will set it
> to "LocationResourceLoader(spec.origin)", which will rely solely on
> Loader.get_data() for content access
> 
> * We'd also provide an optimised FilesystemResourceLoader for use with
> actual installed packages where the resources already exist on disk
> and don't need to be copied to memory or a temporary directory to
> provide a suitable API.
> 
> * For abstract data access at the ResourceLoader API level, I like
> "get_anchor()" (returning a suitably descriptive string such that
> "os.path.join(anchor, <relative path>)" will work with get_data() on
> the corresponding module Loader), "get_bytes(<relative path>)",
> "get_bytestream(<relative path>" and "get_filesystem_path(<relative
> path>)". get_anchor() would be the minimum API, with default
> implementations of the other three based on Loader.get_data(), BytesIO
> and tempfile (this would involve suitable use of lazy or on-demand
> imports for the latter two, as we'd need access to these from
> importlib._bootstrap, but wouldn't want to load them on every
> interpreter startup).
> 
> * For the top-level API, I similarly favour
> importlib.resources.get_bytes(), get_bytestream() and
> get_filesystem_path(). However, I would propose that the latter be an
> object implementing a to-be-defined subset of the pathlib Path API,
> rather than a string. Resource listing, etc, would then be handled
> through the existing Path abstraction, rather than defining a new one.
> In the standard library, because we'd just be using a temporary
> directory, we could use real Path objects (although we'd need to add
> weakref support to them to implement the weakref.finalize suggestion I
> make above)
> 
>> As for importlib.resources, that can provide a higher-level API for a
>> file-like object along with some way to say whether the file must be
>> addressable on the filesystem to know if tempfile.NamedTemporaryFile() may
>> be backing the file-like object or if io.BytesIO could provide the API.
>> 
>> This gets me a clean API for loaders and importlib and gets you your real
>> file paths as needed.
> 
> Yep, as you can see above, I agree there are two APIs to be designed
> here - the high level user facing one, and the one between the import
> machinery and plugin authors.
> 
> Cheers,
> Nick.
> 
> -- 
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia

This all sounds reasonable to me, except maybe the weakref bit, I can
imagine getting into some trouble if you do something like:

ctx = SSLContext()
ctx.load_verify_location(cafile=str(importlib.resources.get_filesystem_path()))

Using a weakref isn’t a horrible idea though and I wouldn’t be completely
opposed to it, it would just mean that they have to be sure to keep around
a reference to the pathlib style thing even if they need the path as a string
and are going to cast pathlib into a str. The error message might be confusing
because it’ll work in the common case just fine since if the file is already
on the file system nothing is going to get cleaned up, but lead to errors that
only happen if you’re using a zip import or similar. That kind of transient
error feels like somewhat of a footgun.

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA



More information about the Import-SIG mailing list