[Import-SIG] making it feasible to rely on loaders for reading intra-package data files

PJ Eby pje at telecommunity.com
Wed Feb 5 20:20:25 CET 2014


On Wed, Feb 5, 2014 at 12:13 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>
> On 5 Feb 2014 02:28, "Barry Warsaw" <barry at python.org> wrote:
>>
>> On Feb 05, 2014, at 01:55 AM, Nick Coghlan wrote:
>>
>> >But unfortunately, you can't even import pkg_resources to get at those
>> >without it version locking your entire sys.path.
>>
>> Which supports my point, i.e. that the stdlib should provide reasonable
>> implementations of these APIs that we can promote far and wide.  But FWIW,
>> I've never run into the pkg_resource problems you're describing.
>
> If I hadn't started working on a production RHEL application in a Fedora dev
> environment, I doubt I would have either :)
>
> Fedora hits it because we use pkg_resources dependent layouts to ship
> potentially API incompatible versions of Python packages (CherryPy2 v 3,
> modern Sphinx in EPEL, etc) that target a common system Python install.
>
> The problem is that pkg_resources assumes that either *all* packages are on
> sys.path by default or none of them are, and doesn't allow requirements to
> be supplied incrementally, so while this model *does* work, it isn't always
> pretty and can generate some rather confusing error messages.
>
> The key advantages of a new replacement package for the tasks that
> pkg_resources handles are being able to improve the handling of this
> scenario, break up the interface to better handle less-Chandler-like use
> cases in general, simplify the implementation and decouple it from
> setuptools. However, finding the roundtuits to work on it is a serious
> challenge, especially when pkg_resources isn't generally *broken*, just
> user-unfriendly in some cases. It also takes a fairly deep knowledge of both
> packaging and the import system to even attempt to tackle it, so the
> intersection between "has the required expertise" and "is interested and
> available" is currently the null set :P

I don't think Barry was advocating pkg_resources, but rather, having
*some* equally-powerful resource API available in the stdlib.

But the "resources" part of pkg_resources isn't actually that big, nor
is it strongly connected to the rest of pkg_resources.  The core of it
is just:

ResourceManager -- The main API, implements methods for
resource_string(), resource_stream(), etc., by delegating to
"provider" objects
IResourceProvider -- abstract class that just documents what
operations a resource provider has to implement
get_provider() -- a way to find a __loader__ and look up a
IResourceProvider implementation for it
get_default_cache() -- a function to return the default cache base directory
ExtractionError -- base exception class for resource extraction problems

Most of the above is ridiculously straightforward code -- a stdlib
implementation would mainly rewrite get_provider().  pkg_resources
also contains some provider classes, specifically:

NullProvider -- an abstract base that implements IResourceProvider by
delegation to some "virtual file system" abstract methods
EggProvider -- handle doing paths relative to parent ".egg" container
(could be changed to do wheels)
DefaultProvider -- standard filesystem implementation of virtual file
system methods
EmptyProvider -- empty virtual filesystem (no resources)
ZipProvider -- zipfile virtual fileystem, with some egg-specific
(could be wheel-specific) extraction features

This is the sum total of the bits of pkg_resources relevant to such an
API.  The bits that are .egg specific (one method in EggProvider, a
few in ZipProvider) could be readily translated to wheels, for the
most part.  The get_provider() function is the *only* piece of all
this that calls into the rest of pkg_resources, and that could be
replaced with distlib calls, if anything.  (If the stdlib only
supported module-relative resources, even that wouldn't be necessary:
the API could run directly off of module names instead of
project/distribution names.)

It's possible after reviewing these classes and functions, somebody
would basically say, "screw this, I'll write my own".  Which would
actually be reasonable, because there's hardly anything to these
classes: they weigh in at maybe 600 lines (including extra blank lines
between them) in my last worked-on version, out of nearly 3000 lines
in pkg_resources.  Many of these classes are 20-40 lines --
ResourceManager and ZipProvider are the only ones that run into
hundreds of lines, and in ResourceManager's case it's because of its
extensive docstrings.  IResourceProvider is pure documentation, since
it just documents what methods ResourceManager expects to find.

pkg_resources' resource API is basically just the methods of a
ResourceManager: resource_listdir(), resource_string(), etc.  It
creates a default ResourceManager instance, and then exports its
methods as API functions.  It does it this way because it allows an
app to create its own manager with its own cache policies, cleanup,
etc., but in the default case the direct API is fine.  Few (maybe no)
apps actually make their own ResourceManager, but it gives them the
option of doing so.  (One would simply create a ResourceManager
instance (or subclass instance), and then call its .resource_*()
methods instead of the module-level APIs.)

There: now you know almost as much about the pkg_resources resource
management architecture as I do.  ;-)

Most of what one would do to port this code to a stdlib module would
be to delete the unused bits, and replace .egg path/name/metadata
conventions with .wheel-appropriate ones.

If somebody wants to take a whack at it, I'll be happy to answer
questions.  Really, this stuff is some of the *simplest* code in
pkg_resources that isn't just string parsing code.  And it's really
old, stable code, in the sense that it was among the first parts of
pkg_resources written, and least changed since then: nearly all of it
has last-change dates in 2005, with most changes since then being
minor feature additions post-Distribute-merge for better error
handling, switching away from using zipimport's file cache for zip
directory information, Python 3-support tweaks, .dist-info support,
etc.

(Which also means that there are other people who understood it well
enough to make those additions, including Jason, MvL, and Vinay.
There's also a "philip_thiem" who apparently did the
zipimport->ZipFile changeover about a year ago, and who at first
glance appears -- along with Jason -- to have pretty deeply grokked
the hairiest part of the whole thing, i.e. the zipfile extraction
code.)


More information about the Import-SIG mailing list