Lack of sequential decompression in the zipfile module

Though I am an avid Python programmer, I've never forayed into the area of developing Python itself, so I'm not exactly sure how all this works. I was confused (and somewhat disturbed) to discover recently that the zipfile module offers only one-shot decompression of files, accessible only via the read() method. It is my understanding that the module will handle files of up to 4 GB in size, and the idea of decompressing 4 GB directly into memory makes me a little queasy. Other related modules (zlib, tarfile, gzip, bzip2) all offer sequential decompression, but this does not seem to be the case for zipfile (even though the underlying zlib makes it easy to do). Since I was writing a script to work with potentially very large zipped files, I took it upon myself to write an extract() method for zipfile, which is essentially an adaption of the read() method modeled after tarfile's extract(). I feel that this is something that should really be provided in the zipfile module to make it more usable. I'm wondering if this has been discussed before, or if anyone has ever viewed this as a problem. I can post the code I wrote as a patch, though I'm not sure if my file IO handling is as robust as it needs to be for the stdlib. I'd appreciate any insight into the issue or direction on where I might proceed from here so as to fix what I see as a significant problem. Thanks, Derek

On 2/16/07, Derek Shockey <derek.shockey@gmail.com> wrote:
Though I am an avid Python programmer, I've never forayed into the area of developing Python itself, so I'm not exactly sure how all this works.
I was confused (and somewhat disturbed) to discover recently that the zipfile module offers only one-shot decompression of files, accessible only via the read() method. It is my understanding that the module will handle files of up to 4 GB in size, and the idea of decompressing 4 GB directly into memory makes me a little queasy. Other related modules (zlib, tarfile, gzip, bzip2) all offer sequential decompression, but this does not seem to be the case for zipfile (even though the underlying zlib makes it easy to do).
Since I was writing a script to work with potentially very large zipped files, I took it upon myself to write an extract() method for zipfile, which is essentially an adaption of the read() method modeled after tarfile's extract(). I feel that this is something that should really be provided in the zipfile module to make it more usable. I'm wondering if this has been discussed before,
Not that I know of, but searching Google would better answer that question.
or if anyone has ever viewed this as a problem.
Not that I know of.
I can post the code I wrote as a patch, though I'm not sure if my file IO handling is as robust as it needs to be for the stdlib. I'd appreciate any insight into the issue or direction on where I might proceed from here so as to fix what I see as a significant problem.
Best way is to post it as a patch to the SF tracker for Python (http://sourceforge.net/patch/?group_id=5470). Then hopefully someone will eventually get to it and have a look. Just please understand that it might be a while as it requires someone to take an interest in your patch to put the time and effort to make sure it up to including. To help your chances of getting it included, make sure you do the following: 1. Make it match PEP 7/8 style guidelines. 2. Have unit tests. 3. Have proper documentation. It is okay if it is not in LaTeX if you don't already know the language. -Brett
Thanks, Derek
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/brett%40python.org

Derek Shockey schrieb:
Since I was writing a script to work with potentially very large zipped files, I took it upon myself to write an extract() method for zipfile, which is essentially an adaption of the read() method modeled after tarfile's extract(). I feel that this is something that should really be provided in the zipfile module to make it more usable. I'm wondering if this has been discussed before, or if anyone has ever viewed this as a problem. I can post the code I wrote as a patch, though I'm not sure if my file IO handling is as robust as it needs to be for the stdlib. I'd appreciate any insight into the issue or direction on where I might proceed from here so as to fix what I see as a significant problem.
I think something like this is patch #1121142. Regards, Martin

Derek Shockey <derek.shockey <at> gmail.com> writes:
Though I am an avid Python programmer, I've never forayed into the area of
Since I was writing a script to work with potentially very large zipped files, I took it upon myself to write an extract() method for zipfile, which is essentially an adaption of the read() method modeled after tarfile's extract(). I feel that this is something that should really be provided in the zipfile module to make it more usable. I'm wondering if this has been discussed before, or if anyone has ever viewed this as a problem. I can post the code I wrote as a
developing Python itself, so I'm not exactly sure how all this works.I was confused (and somewhat disturbed) to discover recently that the zipfile module offers only one-shot decompression of files, accessible only via the read() method. It is my understanding that the module will handle files of up to 4 GB in size, and the idea of decompressing 4 GB directly into memory makes me a little queasy. Other related modules (zlib, tarfile, gzip, bzip2) all offer sequential decompression, but this does not seem to be the case for zipfile (even though the underlying zlib makes it easy to do). patch, though I'm not sure if my file IO handling is as robust as it needs to be for the stdlib. I'd appreciate any insight into the issue or direction on where I might proceed from here so as to fix what I see as a significant problem. This is definitely a significant problem. We had to face it at work, and at the end we decided to use zipstream (http://www.doxdesk.com/software/py/zipstream.html) instead of zipfile, but of course having the functionality in the standard library would be much better. Michele Simionato

On 2/16/07, Derek Shockey <derek.shockey@gmail.com> wrote:
Since I was writing a script to work with potentially very large zipped files, I took it upon myself to write an extract() method for zipfile, which is essentially an adaption of the read() method modeled after tarfile's extract(). I feel that this is something that should really be provided in the zipfile module to make it more usable. I'm wondering if this has been discussed before, or if anyone has ever viewed this as a problem. I can post the code I wrote as a patch, though I'm not sure if my file IO handling is as robust as it needs to be for the stdlib. I'd appreciate any insight into the issue or direction on where I might proceed from here so as to fix what I see as a significant problem.
I ran into the same thing and made a patch a long while ago (the one Martin mentioned): https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1121142&group_id=5470 I am actually working on it this weekend; if you'd like to exchange code/test cases/whatever feel free to send me your stuff. I'll try to get a patch that works against the trunk posted today or tomorrow if you want to try it out. Cheers, Alan

On 2/17/07, Alan McIntyre <alan.mcintyre@gmail.com> wrote:
I ran into the same thing and made a patch a long while ago (the one Martin mentioned):
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1121142&group_id=5470
I am actually working on it this weekend; if you'd like to exchange code/test cases/whatever feel free to send me your stuff. I'll try to get a patch that works against the trunk posted today or tomorrow if you want to try it out.
Derek mentioned something that hadn't occurred to me: adding ZipFile.open() is helpful, but if we don't provide something analogous to TarFile.extract, anybody wanting to just extract a file from an archive to disk will have to write their own "get and write chunks from the file object" loop. Since this is likely to be a common task, it makes sense to me to provide this capability in the ZipFile class with an extract method that takes an optional path argument (defaulting to the working directory + path for file in archive). I'll add this to the patch unless somebody greatly disagrees or has a better idea.

Hi Derek, On 2/16/07, Derek Shockey <derek.shockey@gmail.com> wrote:
Though I am an avid Python programmer, I've never forayed into the area of developing Python itself, so I'm not exactly sure how all this works.
I was confused (and somewhat disturbed) to discover recently that the zipfile module offers only one-shot decompression of files, accessible only via the read() method. It is my understanding that the module will handle files of up to 4 GB in size, and the idea of decompressing 4 GB directly into memory makes me a little queasy. Other related modules (zlib, tarfile, gzip, bzip2) all offer sequential decompression, but this does not seem to be the case for zipfile (even though the underlying zlib makes it easy to do).
Not so easy, in fact. Unless you open only one zip member file at a time. If you open many member files concurrently how does file cache will work? Or how many seeks you will have to do if you read from one member file and from other alternatingly? Do you have a file-like interface or just read in chunks? Or, if you need to open more than one member file for writing in the same zip file, then this is not possible at all.
Since I was writing a script to work with potentially very large zipped files, I took it upon myself to write an extract() method for zipfile, which is essentially an adaption of the read() method modeled after tarfile's extract(). I feel that this is something that should really be provided in the zipfile module to make it more usable. I'm wondering if this has been discussed before, or if anyone has ever viewed this as a problem. I can post the code I wrote as a patch, though I'm not sure if my file IO handling is as robust as it needs to be for the stdlib. I'd appreciate any insight into the issue or direction on where I might proceed from here so as to fix what I see as a significant problem.
My Google Summer of Code project was just about this, and I implemented a lot of nice features. These features include: file-like access to zip member files (which solves your problem, and also provides a real file-like interface including .read(), .readline(), etc); support for BZIP2 compression; support for removing a member file; support for encrypting/decrypting member files. The project is hosted at sourceforge [http://ziparchive.sf.net]. You can take a look, and try it. I'm planning to make a new and improved release perfecting the API and doing some code refactoring. I really think that this improved version will be better than all other zip libraries in every aspect, including number of implemented features, speed/efficiency, and being easy to use. I think the time I will take to do this is roughly directly proportional to the amount of feedback (and help) I receive, since I alone can't think about all the needs of such a library. Also, if anyone would like to help developing, that you be great! I have some local code I'm working in, but I can commit this to an svn branch if anyone would like to see/help. Thanks, -- Nilton

Nilton Volpato wrote:
If you open many member files concurrently how does file cache will work? Or how many seeks you will have to do if you read from one member file and from other alternatingly?
If the OS file system cache is doing its job properly, I don't think the seeking should be a problem.
Or, if you need to open more than one member file for writing in the same zip file, then this is not possible at all.
I don't think it would be unreasonable to just not support writing to more than one member at a time. -- Greg

Nilton Volpato schrieb:
My Google Summer of Code project was just about this, and I implemented a lot of nice features. These features include: file-like access to zip member files (which solves your problem, and also provides a real file-like interface including .read(), .readline(), etc); support for BZIP2 compression; support for removing a member file; support for encrypting/decrypting member files.
Unfortunately (?), the 2.6 zipfile module will have much of that, also, please take a look. Regards, Martin
participants (7)
-
"Martin v. Löwis"
-
Alan McIntyre
-
Brett Cannon
-
Derek Shockey
-
Greg Ewing
-
Michele Simionato
-
Nilton Volpato