Extracting C extensions from zipfiles on sys.path (Was: wheels on sys.path clarification (reboot))
Changing the subject to clearly focus the discussion. On 30 January 2014 11:57, Vinay Sajip <vinay_sajip@yahoo.co.uk> wrote:
If you have other reasons for your -1, I'd like to hear them.
OK. Note that this is not, in my view, an issue with wheels, but rather about zipfiles on sys.path, and (deliberate) design limitations of the module loader and zipimport implementations.[1] First of all, it is not possible to load a DLL into a process' memory [2, 3] unless it is stored as a file in the filesystem. So any attempt to import a C extension from a zipfile must, by necessity, involve extracting that DLL to the filesystem. That's where I see the problems. None are deal-breaking issues, but they consist of a number of niggling issues that cumulatively chip away at the reliability of the concept until the end result has enough corner cases and risks to make it unacceptable (depending on your tolerance for risks - there's a definite judgement call involved). The issues I can see are: [4] 1. You need to choose a location to put the extracted file. On Windows in particular, there is no guaranteed-available filesystem location that can be used without risk. Some accounts have no home directory, some (locked down) users have no permissions anywhere but very specific places, even TEMP may not be usable if there's an aggressive housekeeping routine in place - but TEMP is probably the best choice of a bad lot. 2. There are race conditions to consider. If the extraction is not completely isolated per-process, what if 2 processes want to use different versions of the same DLL? How will these be distinguished? [5] So to avoid corner cases you have to assume only the one process uses a given extracted DLL. 3. Clean-up is an issue. How will the extracted files be removed? You can't unload the DLLs from Python, and you can't delete open files in Windows. So do you simply leave the files lying round? Or do you do some sort of atexit dance to run a separate process after the Python process terminates which will do the cleanup? What happens to that process when virus checkers hold the file open? Leaving the files around is probably the most robust answer, but it's not exactly friendly. As I've said elsewhere, these are fundamental issues with importing DLLs from zipfiles, and have no direct relationship to wheels. The only place where having a wheel rather than a general zipfile makes a difference is that a wheel *might* at some point contain metadata that allows the wheel to claim that it's "OK" to load its contents from a zipfile. But my points above are not something that the author of the C extension can address, so there's no way that I can see that an extension author can justifiably set that flag. So: as wheels don't give any additional reliability over any other zipfile, I don't see this (loading C extensions) as a wheel-related feature. Ideally, if these problems can be solved, the solution should be included in the core zipimport module so that all users can benefit. If there are still issues to iron out and experience to be gained, a 3rd party "enhanced zip importer" module would be a reasonable test-bed for the solution. A 3rd party solution could also be appropriate if the caveats and/or limitations were generally acceptable, but sufficient to prohibit stdlib inclusion. The wheel mount API could, if you wanted, look for the existence of that enhanced zipimport module and use it when appropriate, but baking the feature into wheel mount just limits your user base (and hence your audience for raising bug reports, etc) needlessly. I hope this explains my reasoning in sufficient detail. FINAL DISCLAIMER: I have no objection to this feature being provided per se, any more than I object to the existence of (say) Zope. Just because I'm not a member of the target audience doesn't mean that it's not a feature that some might benefit from. All I'm trying to do here is offer my input as someone who was involved in the initial implementation of zipimport, and who has kept an interested eye on how it has been used in the 11 years since its introduction - and in particular how people have tried to overcome the limitations we felt we had to impose when designing it. Ultimately, I would be overjoyed if someone could find a solution to this issue (in much the same way as I'm delighted by what Brett has done with importlib). Paul Footnotes: [1] Historical footnote - I was directly involved with the design of PEP 302 and the zipimport implementation, and we made a deliberate choice to only look at pure Python files, because the platform issues around C extensions were "too hard". [2] I'm talking from a Windows perspective here. I do not have sufficient low-level knowledge of Unix to comment on that case. I suspect that the issues are similar but I defer to the platform experts. [3] There is, I believe, code "out there" on the internet to map a DLL image into a process based purely in memory, but I think it's a fairly gross hack. I have a suspicion that someone - possibly Thomas Heller - experimented with it at one time, but never came up with a viable implementation. There's also TCL's tclkit technology, which *may* support binary extensions, and may be worth a look, but TCL has virtual filesystem support built in quite deep in the core, so how it works may not be applicable to Python. [4] I'm suggesting answers to the questions I'm raising here. The answers *may* be wrong - I've never tried to design a robust solution to this issue - but I believe the questions are the important point here. Please don't focus on why my suggested approach is wrong - I know it is! [5] To be fair, this is where the wheel metadata might help in distinguishing. But consider development and testing, where repeated test runs would not typically have different versions, but the user might well want to test whether running from zip still works. So wheel metadata helps, but isn't a complete solution. And compile data is probably just as good, so let's keep assuming we are looking at a general zipimport facility.
-------------------------------------------- On Thu, 30/1/14, Paul Moore <p.f.moore@gmail.com> wrote: Subject: Extracting C extensions from zipfiles on sys.path (Was: wheels on sys.path clarification (reboot)) To: "Vinay Sajip" <vinay_sajip@yahoo.co.uk> Cc: "Distutils" <distutils-sig@python.org> Date: Thursday, 30 January, 2014, 13:23
OK. Note that this is not, in my view, an issue with wheels, but rather about zipfiles on sys.path, and (deliberate) design limitations of the module loader and zipimport implementations.[1]
Okay, I'm glad that's clarified. Otherwise, there's a danger of it being conflated with an "importing wheels is bad" viewpoint which relates even to pure-Python code.
First of all, it is not possible to load a DLL into a process' memory [2, 3] unless it is stored as a file in the filesystem. So any attempt to import a C extension from a zipfile must, by necessity, involve extracting that DLL to the filesystem. That's where I see the problems. None are deal-breaking issues, but they consist of a number of niggling issues that cumulatively chip away at the reliability of the concept until the end result has enough corner cases and risks to make it unacceptable (depending on your tolerance for risks - there's a definite judgement call involved).
Okay, let's work through the issues you raise.
1. You need to choose a location to put the extracted file. On Windows in particular, there is no guaranteed-available filesystem location that can be used without risk. Some accounts have no home directory, some (locked down) users have no permissions anywhere but very specific places, even TEMP may not be usable if there's an aggressive housekeeping routine in place - but TEMP is probably the best choice of a bad lot.
There are always going to be environments where you can't do stuff, say because of corporate lock-down policies. There is no requirement on any solution to do the impossible; merely to fail with an informative error message. There is lots of other functionality that fails in these environments, too (e.g. access to the Internet). So my view is that this should not be an obstacle to developing such functionality for environments where it can work, as long as it fails fast and informatively when it fails.
2. There are race conditions to consider. If the extraction is not completely isolated per-process, what if 2 processes want to use different versions of the same DLL? How will these be distinguished?
Processes are isolated from each other, so that doesn't stop different processes using different versions of DLLs. Software in those DLLs needs to be designed to avoid stepping on its own toes, but that's orthogonal to whether it came from a zip or not (.NET SxS assemblies, for example - if they have files they write to, they need to not overwrite each other's stuff). Distlib covers this by placing the DLL in a location which is based on the absolute pathname of the wheel it came from. So any software which uses the exact same wheel will use the same DLL, but other software which uses a wheel with a different version (which by definition will have a different absolute path) will use a different DLL. Perhaps there are holes in this approach - if so, please point out any that you see.
3. Clean-up is an issue. How will the extracted files be removed? You can't unload the DLLs from Python, and you can't delete open files in Windows. So do you simply leave the files lying round? Or do you do some sort of atexit dance to run a separate process after the Python process terminates which will do the cleanup? What happens to that process when virus checkers hold the file open? Leaving the files around is probably the most robust answer, but it's not exactly friendly.
But it's a drawback of the underlying platform, and it seems to me OK to do the best that's possible (like we do with pip updating itself on Windows). Also, it's not clear if you always want to clean up: perhaps you don't want to extract DLLs every time if they're already there (let's not go down a cache invalidation rabbit-hole - later is definitely better than right now ;-) My view is that cleanup belongs with the application, not the library - the application developer is best placed to know what the right thing to do is for that particular application. This is currently covered in distlib by having an API which provides the root directory path for the cache. Cache cleanup can be done on start-up before any wheels are mounted. By the way, surely you've seen how much cruft accumulates in TEMP on Windows machines? It's not as if Windows users' expectations can be particularly high here ;-) I'm all for keeping things tidy, of course.
The only place where having a wheel rather than a general zipfile makes a difference is that a wheel *might* at some point contain metadata that allows the wheel to claim that it's "OK" to load its contents from a zipfile. But my points above are not something that the author of the C extension can address, so there's no way that I can see that an extension author can justifiably set that flag.
It's not the extension author exactly, it's the wheel packager. In a corporate environment, they might be someone in a systems integrator role. Even if they are one and the same, the assertion is that the wheel is designed to run from a zip. Beyond that, it's up to the application developer and/or systems integrator: it doesn't mean it will work in every circumstance as one would wish. Are you telling me that most Python packages on PyPI, conventionally installed, will handle gracefully an out-of-disk-space condition? Where the cause of the failure is immediately apparent rather than "weird" at first glance? I doubt it.
Ideally, if these problems can be solved, the solution should be included in the core zipimport module so that all users can benefit. If there are still issues to iron out and
Nick and I have both given reasons why zipimport might not be best placed to pioneer this. Although you are not concerned with binary compatibility, it is a valid concern which needs addressing, and bolstering the WHEEL metadata seems the right place for such work.
but baking the feature into wheel mount just limits your user base (and hence your audience for raising bug reports, etc) needlessly.
I'm not hung up about exactly where the functionality gets implemented, just that it's useful. It would seem better to focus on real issues (like the ones you've raised, and the ones I raised about binary compatibility) rather than debating how best to package it. If someone is interested in developing this area, they will put in the work of looking at the issues and coming up with ideas to address them, whether it's package or Y. What makes you think an enhanced third-party zipimport module is suddenly going to get lots of eyeballs? The functionality in distlib as a whole is a lot more useful (this being a tiny corner of it), but there aren't too many eyeballs on that.
implementation of zipimport, and who has kept an interested eye on how it has been used in the 11 years since its introduction - and in particular how people have tried to overcome the limitations we felt we had to impose when designing it.
I didn't know - thanks for your work on zipimport, I think it's great. Surely 11 years is long enough for that initial functionality to have bedded down? Often, getting a new feature in means working to a feature-freeze deadline where not every avenue can be explored. That's par for the course, especially where hard technical problems are to be faced. But, surely there comes a time when it's worth taking another look, and seeing if we can push the envelope further? I hope that in the above I've addressed at least in part the issues you've raised - I'm sure you'll tell me if not.
choice to only look at pure Python files, because the platform issues around C extensions were "too hard".
Were those just the issues you raised here? Wasn't binary compatibility discussed?
There is, I believe, code "out there" on the internet to map a DLL image into a process based purely in memory,
I would discount this: any solution has to work on multiple Windows versions and the lower level the solution, the more the risk. We're talking (in the current implementation) just about file-system operations and import_dynamic, which are fairly mature and well understood by comparison.
[5] To be fair, this is where the wheel metadata might help in distinguishing. But consider development and testing, where repeated test runs would not typically have different versions, but the user might well want to test whether running from zip still works.
What's wrong with having the test setup code clearing the DLL cache for every run? Clearly the wheel has to be rebuilt for each run, but that's not going to be a show-stopper if the tests are arranged optimally. Anyway, thanks for taking the time to raise the issues in detail. This kind of discussion will hopefully help to move things forward. Regards, Vinay Sajip
Am 30.01.2014 14:23, schrieb Paul Moore:
First of all, it is not possible to load a DLL into a process' memory [2, 3] unless it is stored as a file in the filesystem.
[...]
[2] I'm talking from a Windows perspective here. I do not have sufficient low-level knowledge of Unix to comment on that case. I suspect that the issues are similar but I defer to the platform experts.
[3] There is, I believe, code "out there" on the internet to map a DLL image into a process based purely in memory, but I think it's a fairly gross hack. I have a suspicion that someone - possibly Thomas Heller - experimented with it at one time, but never came up with a viable implementation. There's also TCL's tclkit technology, which *may* support binary extensions, and may be worth a look, but TCL has virtual filesystem support built in quite deep in the core, so how it works may not be applicable to Python.
Well, py2exe uses code that loads dlls from zip-files directly into process memory and executes it. This is used for 'single-file' applications that py2exe creates; it is triggered by the appropriate 'bundle_files' option. It works quite well for some applications (wx apps for example), but not for others. Applications that use numpy for example crashes, but the reason is that py2exe *also* puts the mkl-library (or maybe even more stuff like windows dlls) that numpy needs into the zip. So, the trick is to only load python-extensions and not arbitrary dlls from the zip. Changing the packing process so that py2exe leave these non-python dlls in the file system makes the exe work again. At least now the range of apps that work reliably is much larger. Anyway; the py2exe installer installs the python-extension and correspinding py-module that implements this magic, so anyone can experiment with it; it is called 'zipextimporter.py'. Here's the start of the docstring: r"""zipextimporter - an importer which can import extension modules from zipfiles This file and also _memimporter.pyd is part of the py2exe package. Overview ======== zipextimporter.py contains the ZipExtImporter class which allows to load Python binary extension modules contained in a zip.archive, without unpacking them to the file system. Call the zipextimporter.install() function to install the import hook, add a zip-file containing .pyd or .dll extension modules to sys.path, and import them. It uses the _memimporter extension which uses code from Joachim Bauch's MemoryModule library. This library emulates the win32 api function LoadLibrary. """ I wanted to post this to clarify the current state. Thomas
Thomas Heller <theller <at> ctypes.org> writes:
It uses the _memimporter extension which uses code from Joachim Bauch's MemoryModule library. This library emulates the win32 api function LoadLibrary.
When this was mentioned on python-dev last March you said it was 32-bit and Python 2.x only, is that still the case? Regards, Vinay Sajip
Am 31.01.2014 09:35, schrieb Vinay Sajip:
Thomas Heller <theller <at> ctypes.org> writes:
It uses the _memimporter extension which uses code from Joachim Bauch's MemoryModule library. This library emulates the win32 api function LoadLibrary.
When this was mentioned on python-dev last March you said it was 32-bit and Python 2.x only, is that still the case?
This limitation does no longer exist; it works in 2.x and 3.x as well as on 32-bit and 64-bit. Although the code is not yet released as wheel/bdist_wininst/egg or whatever. Thomas
On Fri, 31/1/14, Thomas Heller <theller@ctypes.org> wrote:
This limitation does no longer exist; it works in 2.x and 3.x as well as on 32-bit and 64-bit.
That's good to know. Of course, as it's platform specific, it can't be used as a generic C extension import facility, though I suppose that if/when such a capability becomes available on mainstream platforms, the memimporter API could be adapted to cover it. Regards, Vinay Sajip
On 31 Jan 2014 21:00, "Vinay Sajip" <vinay_sajip@yahoo.co.uk> wrote:
On Fri, 31/1/14, Thomas Heller <theller@ctypes.org> wrote:
This limitation does no longer exist; it works in 2.x and 3.x as well as on 32-bit and 64-bit.
That's good to know. Of course, as it's platform specific, it can't be used as a generic C extension import facility, though I suppose that if/when such a capability becomes available on mainstream platforms, the memimporter API could be adapted to cover it.
Ah, but unlinking is more reliable on most POSIX filesystems than it is on NTFS, so even using this technique on Windows and a cache elsewhere could be beneficial. Cheers, Nick.
Regards,
Vinay Sajip _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
participants (4)
-
Nick Coghlan
-
Paul Moore
-
Thomas Heller
-
Vinay Sajip