-------------------------------------------- On Thu, 30/1/14, Paul Moore <p.f.moore@gmail.com> wrote: Subject: Extracting C extensions from zipfiles on sys.path (Was: wheels on sys.path clarification (reboot)) To: "Vinay Sajip" <vinay_sajip@yahoo.co.uk> Cc: "Distutils" <distutils-sig@python.org> Date: Thursday, 30 January, 2014, 13:23
OK. Note that this is not, in my view, an issue with wheels, but rather about zipfiles on sys.path, and (deliberate) design limitations of the module loader and zipimport implementations.[1]
Okay, I'm glad that's clarified. Otherwise, there's a danger of it being conflated with an "importing wheels is bad" viewpoint which relates even to pure-Python code.
First of all, it is not possible to load a DLL into a process' memory [2, 3] unless it is stored as a file in the filesystem. So any attempt to import a C extension from a zipfile must, by necessity, involve extracting that DLL to the filesystem. That's where I see the problems. None are deal-breaking issues, but they consist of a number of niggling issues that cumulatively chip away at the reliability of the concept until the end result has enough corner cases and risks to make it unacceptable (depending on your tolerance for risks - there's a definite judgement call involved).
Okay, let's work through the issues you raise.
1. You need to choose a location to put the extracted file. On Windows in particular, there is no guaranteed-available filesystem location that can be used without risk. Some accounts have no home directory, some (locked down) users have no permissions anywhere but very specific places, even TEMP may not be usable if there's an aggressive housekeeping routine in place - but TEMP is probably the best choice of a bad lot.
There are always going to be environments where you can't do stuff, say because of corporate lock-down policies. There is no requirement on any solution to do the impossible; merely to fail with an informative error message. There is lots of other functionality that fails in these environments, too (e.g. access to the Internet). So my view is that this should not be an obstacle to developing such functionality for environments where it can work, as long as it fails fast and informatively when it fails.
2. There are race conditions to consider. If the extraction is not completely isolated per-process, what if 2 processes want to use different versions of the same DLL? How will these be distinguished?
Processes are isolated from each other, so that doesn't stop different processes using different versions of DLLs. Software in those DLLs needs to be designed to avoid stepping on its own toes, but that's orthogonal to whether it came from a zip or not (.NET SxS assemblies, for example - if they have files they write to, they need to not overwrite each other's stuff). Distlib covers this by placing the DLL in a location which is based on the absolute pathname of the wheel it came from. So any software which uses the exact same wheel will use the same DLL, but other software which uses a wheel with a different version (which by definition will have a different absolute path) will use a different DLL. Perhaps there are holes in this approach - if so, please point out any that you see.
3. Clean-up is an issue. How will the extracted files be removed? You can't unload the DLLs from Python, and you can't delete open files in Windows. So do you simply leave the files lying round? Or do you do some sort of atexit dance to run a separate process after the Python process terminates which will do the cleanup? What happens to that process when virus checkers hold the file open? Leaving the files around is probably the most robust answer, but it's not exactly friendly.
But it's a drawback of the underlying platform, and it seems to me OK to do the best that's possible (like we do with pip updating itself on Windows). Also, it's not clear if you always want to clean up: perhaps you don't want to extract DLLs every time if they're already there (let's not go down a cache invalidation rabbit-hole - later is definitely better than right now ;-) My view is that cleanup belongs with the application, not the library - the application developer is best placed to know what the right thing to do is for that particular application. This is currently covered in distlib by having an API which provides the root directory path for the cache. Cache cleanup can be done on start-up before any wheels are mounted. By the way, surely you've seen how much cruft accumulates in TEMP on Windows machines? It's not as if Windows users' expectations can be particularly high here ;-) I'm all for keeping things tidy, of course.
The only place where having a wheel rather than a general zipfile makes a difference is that a wheel *might* at some point contain metadata that allows the wheel to claim that it's "OK" to load its contents from a zipfile. But my points above are not something that the author of the C extension can address, so there's no way that I can see that an extension author can justifiably set that flag.
It's not the extension author exactly, it's the wheel packager. In a corporate environment, they might be someone in a systems integrator role. Even if they are one and the same, the assertion is that the wheel is designed to run from a zip. Beyond that, it's up to the application developer and/or systems integrator: it doesn't mean it will work in every circumstance as one would wish. Are you telling me that most Python packages on PyPI, conventionally installed, will handle gracefully an out-of-disk-space condition? Where the cause of the failure is immediately apparent rather than "weird" at first glance? I doubt it.
Ideally, if these problems can be solved, the solution should be included in the core zipimport module so that all users can benefit. If there are still issues to iron out and
Nick and I have both given reasons why zipimport might not be best placed to pioneer this. Although you are not concerned with binary compatibility, it is a valid concern which needs addressing, and bolstering the WHEEL metadata seems the right place for such work.
but baking the feature into wheel mount just limits your user base (and hence your audience for raising bug reports, etc) needlessly.
I'm not hung up about exactly where the functionality gets implemented, just that it's useful. It would seem better to focus on real issues (like the ones you've raised, and the ones I raised about binary compatibility) rather than debating how best to package it. If someone is interested in developing this area, they will put in the work of looking at the issues and coming up with ideas to address them, whether it's package or Y. What makes you think an enhanced third-party zipimport module is suddenly going to get lots of eyeballs? The functionality in distlib as a whole is a lot more useful (this being a tiny corner of it), but there aren't too many eyeballs on that.
implementation of zipimport, and who has kept an interested eye on how it has been used in the 11 years since its introduction - and in particular how people have tried to overcome the limitations we felt we had to impose when designing it.
I didn't know - thanks for your work on zipimport, I think it's great. Surely 11 years is long enough for that initial functionality to have bedded down? Often, getting a new feature in means working to a feature-freeze deadline where not every avenue can be explored. That's par for the course, especially where hard technical problems are to be faced. But, surely there comes a time when it's worth taking another look, and seeing if we can push the envelope further? I hope that in the above I've addressed at least in part the issues you've raised - I'm sure you'll tell me if not.
choice to only look at pure Python files, because the platform issues around C extensions were "too hard".
Were those just the issues you raised here? Wasn't binary compatibility discussed?
There is, I believe, code "out there" on the internet to map a DLL image into a process based purely in memory,
I would discount this: any solution has to work on multiple Windows versions and the lower level the solution, the more the risk. We're talking (in the current implementation) just about file-system operations and import_dynamic, which are fairly mature and well understood by comparison.
[5] To be fair, this is where the wheel metadata might help in distinguishing. But consider development and testing, where repeated test runs would not typically have different versions, but the user might well want to test whether running from zip still works.
What's wrong with having the test setup code clearing the DLL cache for every run? Clearly the wheel has to be rebuilt for each run, but that's not going to be a show-stopper if the tests are arranged optimally. Anyway, thanks for taking the time to raise the issues in detail. This kind of discussion will hopefully help to move things forward. Regards, Vinay Sajip