zipfile refactor and AES
Hi, I've written a package that can read and write zip files encrypted with Winzip's AES encryption scheme (https://github.com/danifus/pyzipper/). It is based on Python's zipfile module which I refactored to facilitate adding the AES code as subclasses of the classes defined in the zipfile module. I would like to explore integrating some of the refactoring effort into Python if they are wanted. While the AES specfic code probably isn't suitable for integration due to its dependence on a crypto package, my hope is that the refactor of the zipfile project may be beneficial and, in particular, help development on other features of the zip spec. I'm happy to rework the changes for inclusion in Python. The general goals of the refactor were: - Keep the AES implementation changes in a separate file so changes to zipfile.py could potentially be merged into Python (the cryptically named "minimal_pep8" branch contains minimal cosmetic changes to the code unless the lines around it were changed. The master branch contains all the changes in the minimal_pep8 plus pep8 and a few other changes). - Add hooks for extending the way zipfile works to enable the addition of AES encryption without having to duplicate most of the zipfile module. This included adding hooks to: - Select and call encrypt and decrypt methods. - Read and write new "extra" data records in the central file directory and local header records. - Provide a mechanism to substitute ZipInfo, ZipExtFile and ZipWriteFile classes used in a subclass of ZipFile to ease use of subclassed ZipInfo, ZipExtFile or ZipWriteFile. This avoids having to rewrite large parts of the zipfile module if we only want to change the behaviour of a small part of one of those classes. - Contain all code that reads the header, contents and tail of a file in the archive to within ZipExtFile. Previously reading the header and some other things were done in the ZipFile class before handing the rest of the processing to ZipExtFile. - Contain all code that writes the header, contents and tail of a file in the archive to within ZipWriteFile. Previously reading the header and some other things were done in the ZipFile class before handing the rest of the processing to ZipExtFile. - Move generation of local file header and central directory record content to the ZipInfo class to be alongside the data that it is packing. - Add comments to provide context from the zip spec. Replace explicit numbers to variables with explanatory names or adding comments. Concerns: - The change set is not small. I've attempted to keep each commit in the minimal_pep8 branch focused on a single change to simplify review and kept the unrelated changes, like pep8, to a minimum. I'm happy to put in the work to get these changes into a patch for the Python project, if it is deemed useful. - This could move the internals of zipfile towards a public API (maybe it would be?) which brings additional complexity in managing future changes. - Are the hooks inline with Python's coding style? Is there a different approach to extensibility preferred within the Python project? Let me know any improvements, suggestions or concerns to this refactoring approach you may have. Happy for feedback even if it isn't about integrating the code into Python. Thanks, Dan
This sounds like a valuable refactoring to me. Is it API compatible with the current zipfile module docs? On Mon, 3 Jun 2019, 20:23 Daniel Hillier, <dhilliercode@gmail.com> wrote:
Hi,
I've written a package that can read and write zip files encrypted with Winzip's AES encryption scheme (https://github.com/danifus/pyzipper/). It is based on Python's zipfile module which I refactored to facilitate adding the AES code as subclasses of the classes defined in the zipfile module.
I would like to explore integrating some of the refactoring effort into Python if they are wanted. While the AES specfic code probably isn't suitable for integration due to its dependence on a crypto package, my hope is that the refactor of the zipfile project may be beneficial and, in particular, help development on other features of the zip spec. I'm happy to rework the changes for inclusion in Python.
The general goals of the refactor were: - Keep the AES implementation changes in a separate file so changes to zipfile.py could potentially be merged into Python (the cryptically named "minimal_pep8" branch contains minimal cosmetic changes to the code unless the lines around it were changed. The master branch contains all the changes in the minimal_pep8 plus pep8 and a few other changes). - Add hooks for extending the way zipfile works to enable the addition of AES encryption without having to duplicate most of the zipfile module. This included adding hooks to: - Select and call encrypt and decrypt methods. - Read and write new "extra" data records in the central file directory and local header records. - Provide a mechanism to substitute ZipInfo, ZipExtFile and ZipWriteFile classes used in a subclass of ZipFile to ease use of subclassed ZipInfo, ZipExtFile or ZipWriteFile. This avoids having to rewrite large parts of the zipfile module if we only want to change the behaviour of a small part of one of those classes. - Contain all code that reads the header, contents and tail of a file in the archive to within ZipExtFile. Previously reading the header and some other things were done in the ZipFile class before handing the rest of the processing to ZipExtFile. - Contain all code that writes the header, contents and tail of a file in the archive to within ZipWriteFile. Previously reading the header and some other things were done in the ZipFile class before handing the rest of the processing to ZipExtFile. - Move generation of local file header and central directory record content to the ZipInfo class to be alongside the data that it is packing. - Add comments to provide context from the zip spec. Replace explicit numbers to variables with explanatory names or adding comments.
Concerns: - The change set is not small. I've attempted to keep each commit in the minimal_pep8 branch focused on a single change to simplify review and kept the unrelated changes, like pep8, to a minimum. I'm happy to put in the work to get these changes into a patch for the Python project, if it is deemed useful. - This could move the internals of zipfile towards a public API (maybe it would be?) which brings additional complexity in managing future changes. - Are the hooks inline with Python's coding style? Is there a different approach to extensibility preferred within the Python project?
Let me know any improvements, suggestions or concerns to this refactoring approach you may have. Happy for feedback even if it isn't about integrating the code into Python.
Thanks, Dan _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
One specific pain point with zipfile is that if you zip a directory that contains the target zip file you end up trying to add the target file to itself which leads to a rapidly growing archive. From: Python-ideas <python-ideas-bounces+gadgetsteve=live.co.uk@python.org> On Behalf Of Robert Collins Sent: 03 June 2019 21:12 To: Daniel Hillier <dhilliercode@gmail.com> Cc: Python-Ideas <python-ideas@python.org> Subject: Re: [Python-ideas] zipfile refactor and AES This sounds like a valuable refactoring to me. Is it API compatible with the current zipfile module docs? On Mon, 3 Jun 2019, 20:23 Daniel Hillier, <dhilliercode@gmail.com<mailto:dhilliercode@gmail.com>> wrote: Hi, I've written a package that can read and write zip files encrypted with Winzip's AES encryption scheme (https://github.com/danifus/pyzipper/). It is based on Python's zipfile module which I refactored to facilitate adding the AES code as subclasses of the classes defined in the zipfile module. I would like to explore integrating some of the refactoring effort into Python if they are wanted. While the AES specfic code probably isn't suitable for integration due to its dependence on a crypto package, my hope is that the refactor of the zipfile project may be beneficial and, in particular, help development on other features of the zip spec. I'm happy to rework the changes for inclusion in Python. The general goals of the refactor were: - Keep the AES implementation changes in a separate file so changes to zipfile.py could potentially be merged into Python (the cryptically named "minimal_pep8" branch contains minimal cosmetic changes to the code unless the lines around it were changed. The master branch contains all the changes in the minimal_pep8 plus pep8 and a few other changes). - Add hooks for extending the way zipfile works to enable the addition of AES encryption without having to duplicate most of the zipfile module. This included adding hooks to: - Select and call encrypt and decrypt methods. - Read and write new "extra" data records in the central file directory and local header records. - Provide a mechanism to substitute ZipInfo, ZipExtFile and ZipWriteFile classes used in a subclass of ZipFile to ease use of subclassed ZipInfo, ZipExtFile or ZipWriteFile. This avoids having to rewrite large parts of the zipfile module if we only want to change the behaviour of a small part of one of those classes. - Contain all code that reads the header, contents and tail of a file in the archive to within ZipExtFile. Previously reading the header and some other things were done in the ZipFile class before handing the rest of the processing to ZipExtFile. - Contain all code that writes the header, contents and tail of a file in the archive to within ZipWriteFile. Previously reading the header and some other things were done in the ZipFile class before handing the rest of the processing to ZipExtFile. - Move generation of local file header and central directory record content to the ZipInfo class to be alongside the data that it is packing. - Add comments to provide context from the zip spec. Replace explicit numbers to variables with explanatory names or adding comments. Concerns: - The change set is not small. I've attempted to keep each commit in the minimal_pep8 branch focused on a single change to simplify review and kept the unrelated changes, like pep8, to a minimum. I'm happy to put in the work to get these changes into a patch for the Python project, if it is deemed useful. - This could move the internals of zipfile towards a public API (maybe it would be?) which brings additional complexity in managing future changes. - Are the hooks inline with Python's coding style? Is there a different approach to extensibility preferred within the Python project? Let me know any improvements, suggestions or concerns to this refactoring approach you may have. Happy for feedback even if it isn't about integrating the code into Python. Thanks, Dan _______________________________________________ Python-ideas mailing list Python-ideas@python.org<mailto:Python-ideas@python.org> https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Robert Collins wrote:
Is it API compatible with the current zipfile module docs?
Yes it should be (that was my intention during the refactor and if I accidentally introduced an incompatibility, I'll fix that up before I attempt to integrate the work). I had a look through the docs again and I can't see any incompatibilities. The zipfile module's test suite is currently passing against the repo too. I forked the tests into the repo to use as a basis for this work. On Tue, Jun 4, 2019 at 7:24 AM Steve Barnes <GadgetSteve@live.co.uk> wrote:
One specific pain point with zipfile is that if you zip a directory that contains the target zip file you end up trying to add the target file to itself which leads to a rapidly growing archive.
I think this could be another Python-ideas itself as it could be implemented without a refactor and there is already a lot going on in the refactor at the moment. Cheers, Dan
On Mon, Jun 03, 2019 at 09:24:03PM +0000, Steve Barnes wrote:
One specific pain point with zipfile is that if you zip a directory that contains the target zip file you end up trying to add the target file to itself which leads to a rapidly growing archive.
If that is accurate, it probably should be reported as a bug in zipfile. -- Steven
participants (4)
-
Daniel Hillier
-
Robert Collins
-
Steve Barnes
-
Steven D'Aprano