Re: [Distutils] Extracting C extensions from zipfiles on sys.path (Was: wheels on sys.path clarification (reboot))

30 Jan 2014

      --------------------------------------------
On Thu, 30/1/14, Paul Moore <p.f.moore@gmail.com> wrote:

 Subject: Extracting C extensions from zipfiles on sys.path (Was: wheels on sys.path clarification (reboot))
 To: "Vinay Sajip" <vinay_sajip@yahoo.co.uk>
 Cc: "Distutils" <distutils-sig@python.org>
 Date: Thursday, 30 January, 2014, 13:23
...
OK. Note that this is not, in my view, an issue with wheels, but rather
about zipfiles on sys.path, and (deliberate) design limitations of the
module loader and zipimport implementations.[1]
Okay, I'm glad that's clarified. Otherwise, there's a danger of it being
conflated with an "importing wheels is bad" viewpoint which relates
even to pure-Python code.
...
First of all, it is not possible to load a DLL into a process' memory
[2, 3] unless it is stored as a file in the filesystem. So any attempt
to import a C extension from a zipfile must, by necessity, involve
extracting that DLL to the filesystem. That's where I see the
problems. None are deal-breaking issues, but they consist of a
number of niggling issues that cumulatively chip away at the
reliability of the concept until the end result has enough corner
cases and risks to make it unacceptable (depending on your
tolerance for risks - there's a definite judgement call involved).
Okay, let's work through the issues you raise.
...
1. You need to choose a location to put the extracted file. On
Windows in particular, there is no guaranteed-available
filesystem location that can be used without risk. Some
accounts have no home directory, some (locked down) users
have no permissions anywhere but very specific places, even
TEMP may not be usable if there's an aggressive housekeeping
routine in place - but TEMP is probably the best choice of a bad
lot.
There are always going to be environments where you can't do
stuff, say because of corporate lock-down policies. There is no
requirement on any solution to do the impossible; merely to fail
with an informative error message. There is lots of other
functionality that fails in these environments, too (e.g. access to
the Internet). So my view is that this should not be an obstacle
to developing such functionality for environments where it can
work, as long as it fails fast and informatively when it fails.
...
2. There are race conditions to consider. If the extraction is
not completely isolated per-process, what if 2 processes
want to use different versions of the same DLL? How will
these be distinguished?
Processes are isolated from each other, so that doesn't stop
different processes using different versions of DLLs. Software
in those DLLs needs to be designed to avoid stepping on its
own toes, but that's orthogonal to whether it came from a zip
or not (.NET SxS assemblies, for example - if they have files
they write to, they need to not overwrite each other's stuff).

Distlib covers this by placing the DLL in a location which is
based on the absolute pathname of the wheel it came from.
So any software which uses the exact same wheel will use
the same DLL, but other software which uses a wheel with a
different version (which by definition will have a different
absolute path) will use a different DLL.

Perhaps there are holes in this approach - if so, please point
out any that you see.
...
3. Clean-up is an issue. How will the extracted files be
removed? You can't unload the DLLs from Python, and
you can't delete open files in Windows. So do you simply
leave the files lying round? Or do you do some sort of atexit
dance to run a separate process after the Python process
terminates which will do the cleanup? What happens
to that process when virus checkers hold the file open?
Leaving the files around is probably the most robust answer,
but it's not exactly friendly.
But it's a drawback of the underlying platform, and it seems to
me OK to do the best that's possible (like we do with pip
updating itself on Windows). Also, it's not clear if you always
want to clean up: perhaps you don't want to extract DLLs
every time if they're already there (let's not go down a cache
invalidation rabbit-hole - later is definitely better than right now ;-)

My view is that cleanup belongs with the application, not the
library - the application developer is best placed to know what
the right thing to do is for that particular application.

This is currently covered in distlib by having an API which
provides the root directory path for the cache. Cache cleanup
can be done on start-up before any wheels are mounted.

By the way, surely you've seen how much cruft accumulates
in TEMP on Windows machines? It's not as if Windows users'
expectations can be particularly high here ;-) I'm all for keeping
things tidy, of course.
...
The only place where having a wheel rather than a general
zipfile makes a difference is that a wheel *might* at some
point contain metadata that allows the wheel to claim that it's
"OK" to load its contents from a zipfile.
But my points above are not something that the author of the
C extension can address, so there's no way that I can see
that an extension author can justifiably set that flag.
It's not the extension author exactly, it's the wheel packager.
In a corporate environment, they might be someone in a 
systems integrator role. Even if they are one and the same,
the assertion is that the wheel is designed to run from a zip.
Beyond that, it's up to the application developer and/or
systems integrator: it doesn't mean it will work in every
circumstance as one would wish. Are you telling me that
most Python packages on PyPI, conventionally installed, 
will handle gracefully an out-of-disk-space condition? Where
the cause of the failure is immediately apparent rather than
"weird" at first glance? I doubt it.
...
Ideally, if these problems can be solved, the solution should
be included in the core zipimport module so that all users
can benefit. If there are still issues to iron out and
Nick and I have both given reasons why zipimport might not be
best placed to pioneer this. Although you are not concerned
with binary compatibility, it is a valid concern which needs
addressing, and bolstering the WHEEL metadata seems the
right place for such work.
...
but baking the feature into wheel mount just limits your user
base (and hence your audience for raising bug reports, etc)
needlessly.
I'm not hung up about exactly where the functionality gets
implemented, just that it's useful. It would seem better to focus
on real issues (like the ones you've raised, and the ones I
raised about binary compatibility) rather than debating how best
to package it.   If someone is interested in developing this area,
they will put in the work of looking at the issues and coming up
with ideas to address them, whether it's package or Y. What
makes you think an enhanced third-party zipimport module is
suddenly going to get lots of eyeballs? The functionality in distlib
as a whole is a lot more useful (this being a tiny corner of it), but
there aren't too many eyeballs on that.
...
implementation of zipimport, and who has kept an interested
eye on how it has been used in the 11 years since its
introduction - and in particular how people have tried to
overcome the limitations we felt we had to impose when
designing it.
I didn't know - thanks for your work on zipimport, I think it's great.
Surely 11 years is long enough for that initial functionality to
have bedded down? Often, getting a new feature in means
working to a feature-freeze deadline where not every avenue
can be explored. That's par for the course, especially where
hard technical problems are to be faced. But, surely there
comes a time when it's worth taking another look, and seeing
if we can push the envelope further?

I hope that in the above I've addressed at least in part the
issues you've raised - I'm sure you'll tell me if not.
...
choice to only look at pure Python files, because the
platform issues around C extensions were "too hard".
Were those just the issues you raised here? Wasn't
binary compatibility discussed?
...
There is, I believe, code "out there" on the internet to
map a DLL image into a process based purely in memory,
I would discount this: any solution has to work on multiple
Windows versions and the lower level the solution, the more
the risk. We're talking (in the current implementation) just
about file-system operations and import_dynamic, which are
fairly mature and well understood by comparison.
...
[5] To be fair, this is where the wheel metadata might help
in distinguishing. But consider development and testing,
where repeated test runs would not typically have different
versions, but the user might well want to test whether
running from zip still works.
What's wrong with having the test setup code clearing the DLL
cache for every  run? Clearly the wheel has to be rebuilt for
each run, but that's not going to be a show-stopper if the tests
are arranged optimally.

Anyway, thanks for taking the time to raise the issues in detail.
This kind of discussion will hopefully help to move things
forward.

Regards,

Vinay Sajip