zipfile still has 2GB boundary bug
The "2GB bug" that was supposed to be fixed in http://python.org/sf/679953 was not actually fixed. The zipinfo offsets in the structures are still signed longs, so the fix allows you to write one file that extends past the 2G boundary, but if any extend past that point you are screwed. I have opened a new bug and patch that should fix this issue http://python.org/sf/1189216. This is a backport candidate to 2.4.2 and 2.3.6 (if that ever happens). On a related note, if anyone else has a bunch of really big and ostensibly broken zip archives created by dumb versions of the zipfile module, I have written a script that can rebuild the central directory in-place. Ping me off-list if you're interested and I'll clean it up. Someone should think about rewriting the zipfile module to be less hideous, include a repair feature, and be up to date with the latest specifications http://www.pkware.com/company/standards/appnote/. Additionally, it'd also be useful if someone were to include support for Apple's "extensions" to the zip format (the __MACOSX folder and its contents) that show up when BOM (private framework) is used to create archives (i.e. Finder in Mac OS X 10.3+). I'm not sure if these are documented anywhere, but I can help with reverse engineering if someone is interested in writing the code. On that note, Mac OS X 10.4 (Tiger) is supposed to have new APIs (or changes to existing APIs?) to facilitate resource fork preservation, ACLs, and Spotlight hooks in tar, cp, mv, etc. Someone should spend some time looking at the Darwin 8 sources for these tools (when they're publicly available in the next few weeks) to see what would need to be done in Python to support them in the standard library (the os, tarfile, etc. modules). -bob
Someone should think about rewriting the zipfile module to be less hideous, include a repair feature, and be up to date with the latest specifications http://www.pkware.com/company/standards/appnote/.
-- and allow *deleting* a file from a zipfile. As far as I can tell, you now can't (except by rewriting everything but that to a new zipfile and renaming). Somewhere I saw a patch request for this, but it was languishing, a year or more old. Or am I just totally missing something? Charles Hartman
On Apr 25, 2005, at 7:53 AM, Charles Hartman wrote:
Someone should think about rewriting the zipfile module to be less hideous, include a repair feature, and be up to date with the latest specifications http://www.pkware.com/company/standards/appnote/.
-- and allow *deleting* a file from a zipfile. As far as I can tell, you now can't (except by rewriting everything but that to a new zipfile and renaming). Somewhere I saw a patch request for this, but it was languishing, a year or more old. Or am I just totally missing something?
No, you're not missing anything. Deleting is hard, I guess. Either you'd have to shuffle the zip file around to reclaim the space, or just leave that spot alone and just remove its entry in the central directory. You'd probably want to look at what other software does to decide which approach to use (by default?). I don't see any markers in the format that would otherwise let you say "this file was deleted". -bob
Someone should think about rewriting the zipfile module to be less hideous, include a repair feature, and be up to date with the latest specifications http://www.pkware.com/company/standards/appnote/.
-- and allow *deleting* a file from a zipfile. As far as I can tell, you now can't (except by rewriting everything but that to a new zipfile and renaming). Somewhere I saw a patch request for this, but it was languishing, a year or more old. Or am I just totally missing something?
Please don't propose a grand rewrite (even it's only a single module). Given that the API is mostly sensible, please propose gradual refactoring of the implementation, perhaps some new API methods, and so on. Don't throw away the work that went into making it work in the first place! http://www.joelonsoftware.com/articles/fog0000000069.html -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Apr 26, 2005, at 8:24 PM, Guido van Rossum wrote:
Someone should think about rewriting the zipfile module to be less hideous, include a repair feature, and be up to date with the latest specifications http://www.pkware.com/company/standards/appnote/.
-- and allow *deleting* a file from a zipfile. As far as I can tell, you now can't (except by rewriting everything but that to a new zipfile and renaming). Somewhere I saw a patch request for this, but it was languishing, a year or more old. Or am I just totally missing something?
Please don't propose a grand rewrite (even it's only a single module). Given that the API is mostly sensible, please propose gradual refactoring of the implementation, perhaps some new API methods, and so on. Don't throw away the work that went into making it work in the first place!
Well, I didn't necessarily mean it should be thrown away and started from scratch -- however, once you get all the ugly out of it, there's not much left! Obviously there's something wrong with the way it's written if it took years and *several passes* to correctly identify and fix a simple format character case bug. Most of this can be blamed on the struct module, which is more obscure and error-prone than writing the same code in C. One of the most useful things that could happen to the zipfile module would be a stream interface for both reading and writing. Right now it's slow and memory hungry when dealing with large chunks. The use case that lead me to fix this bug is a tool that archives video to zip files of targa sequences with a reference QuickTime movie.. so I end up with thousands of bite sized chunks. This >2GB bug really caused me some grief in that I didn't test with such large sequences because I didn't have any. I didn't end up finding out about it until months later because client *ignored* the exceptions raised by the GUI and came back to me with broken zip files. Fortunately the TOC in a zip file can be reconstructed from an otherwise pristine stream. Of course, I had to rewrite half of the zipfile module to come up with such a recovery program, because it's not designed well enough to let me build such a tool on top of it. Another "bug" I ran into was that it has some crazy default for the ZipInfo record: it assumes the platform ("create_system") is Windows regardless of where you are! This caused some really subtle and annoying issues with some unzip tools (of course, on everyone's machines except mine). Fortunately someone was able to figure out why and send me a patch, but it was completely unexpected and I didn't see such craziness documented anywhere. If it weren't for this patch, it'd either still be broken, or I'd have switched to some other way of creating archives! The zipfile module is good enough to create input files for zipimport.. which is well tested and generally works -- barring the fact that zipimport has quite a few rough edges of its own. I certainly wouldn't recommend it for any heavy duty tasks in its current state. -bob
Bob Ippolito wrote:
One of the most useful things that could happen to the zipfile module would be a stream interface for both reading and writing. Right now it's slow and memory hungry when dealing with large chunks. The use case that lead me to fix this bug is a tool that archives video to zip files of targa sequences with a reference QuickTime movie.. so I end up with thousands of bite sized chunks.
While it's probably not an improvement on the order of magnitude you're looking for, there's a patch (1121142) that lets you read large items out of a zip archive via a file-like object. I'm occasionally running into the 2GB problem myself, so if any changes are made to get around that I can at least help out by testing it against some "real-life" data sets. Alan
Bob Ippolito wrote:
The zipfile module is good enough to create input files for zipimport.. which is well tested and generally works -- barring the fact that zipimport has quite a few rough edges of its own. I certainly wouldn't recommend it for any heavy duty tasks in its current state.
That's interesting because Java seems to suffer from similar problems. In the early days of Java, although a jar file was a zip file, Java wouldn't read jar files created by the standard zip utilities I used. I think the distinction was that the jar utility stored the files uncompressed. Java is fixed now, but I think it illustrates that zip files are non-trivial. BTW, I don't think the jar utility can delete files from a zip file either. ;-) Shane
Please don't propose a grand rewrite (even it's only a single module). Given that the API is mostly sensible, please propose gradual refactoring of the implementation, perhaps some new API methods, and so on. Don't throw away the work that went into making it work in the first place!
Well, I didn't necessarily mean it should be thrown away and started from scratch
Well, you *did* say "rewrite". :-)
-- however, once you get all the ugly out of it, there's not much left! Obviously there's something wrong with the way it's written if it took years and *several passes* to correctly identify and fix a simple format character case bug. Most of this can be blamed on the struct module, which is more obscure and error-prone than writing the same code in C.
I think the reason is different -- it just hasn't had all that much use beyond the one use case for which it was written (zipping up the Python library). Also, don't underestimate the baroqueness of the zip spec.
One of the most useful things that could happen to the zipfile module would be a stream interface for both reading and writing. Right now it's slow and memory hungry when dealing with large chunks. The use case that lead me to fix this bug is a tool that archives video to zip files of targa sequences with a reference QuickTime movie.. so I end up with thousands of bite sized chunks.
Sounds like a use case nobody else has tried yet.
This >2GB bug really caused me some grief in that I didn't test with such large sequences because I didn't have any. I didn't end up finding out about it until months later because client *ignored* the exceptions raised by the GUI and came back to me with broken zip files. Fortunately the TOC in a zip file can be reconstructed from an otherwise pristine stream. Of course, I had to rewrite half of the zipfile module to come up with such a recovery program, because it's not designed well enough to let me build such a tool on top of it.
Given more typical use cases for zip files (sending around collections of source files) I'm not surprised that a bug that only occurs for files >2GB remained hidden for so long. I don't remember if you have Python CVS permissions, but you sound like you really know the module as well as the zip file spec, so I'm hoping that you'll find the time to do some reconstructive surgery on the zip module for Python 2.5, without breaking the existing APIs. I like the idea you have for a stream API; I recall that the one time I had to use it I was surprised that the API dealt with files as string buffers exclusively.
Another "bug" I ran into was that it has some crazy default for the ZipInfo record: it assumes the platform ("create_system") is Windows regardless of where you are!
I vaguely recall that the initial author was a Windows-head; perhaps he didn't realize how useful the module would be on other platforms, or that it would make any difference at all.
This caused some really subtle and annoying issues with some unzip tools (of course, on everyone's machines except mine). Fortunately someone was able to figure out why and send me a patch, but it was completely unexpected and I didn't see such craziness documented anywhere. If it weren't for this patch, it'd either still be broken, or I'd have switched to some other way of creating archives!
The zipfile module is good enough to create input files for zipimport.. which is well tested and generally works -- barring the fact that zipimport has quite a few rough edges of its own. I certainly wouldn't recommend it for any heavy duty tasks in its current state.
So, please fix it! -- --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (5)
-
Alan McIntyre
-
Bob Ippolito
-
Charles Hartman
-
Guido van Rossum
-
Shane Hathaway