Copying zlib compression objects
I'm writing a program in python that creates tar files of a certain maximum size (to fit onto CD/DVD). One of the problems I'm running into is that when using compression, it's pretty much impossible to determine if a file, once added to an archive, will cause the archive size to exceed the maximum size. I believe that to do this properly, you need to copy the state of tar file (basically the current file offset as well as the state of the compression object), then add the file. If the new size of the archive exceeds the maximum, you need to restore the original state. The critical part is being able to copy the compression object. Without compression it is trivial to determine if a given file will "fit" inside the archive. When using compression, the compression ratio of a file depends partially on all the data that has been compressed prior to it. The current implementation in the standard library does not allow you to copy these compression objects in a useful way, so I've made some minor modifications (patch attached) to the standard 2.4.2 library: - Add copy() method to zlib compression object. This returns a new compression object with the same internal state. I named it copy() to keep it consistent with things like sha.copy(). - Add snapshot() / restore() methods to GzipFile and TarFile. These work only in write mode. snapshot() returns a state object. Passing in this state object to restore() will restore the state of the GzipFile / TarFile to the state represented by the object. Future work: - Decompression objects could use a copy() method too - Add support for copying bzip2 compression objects Although this patch isn't complete, does this seem like a good approach? Cheers, Chris
Please submit your patch to SourceForge. On 2/17/06, Chris AtLee <chris@atlee.ca> wrote:
I'm writing a program in python that creates tar files of a certain maximum size (to fit onto CD/DVD). One of the problems I'm running into is that when using compression, it's pretty much impossible to determine if a file, once added to an archive, will cause the archive size to exceed the maximum size.
I believe that to do this properly, you need to copy the state of tar file (basically the current file offset as well as the state of the compression object), then add the file. If the new size of the archive exceeds the maximum, you need to restore the original state.
The critical part is being able to copy the compression object. Without compression it is trivial to determine if a given file will "fit" inside the archive. When using compression, the compression ratio of a file depends partially on all the data that has been compressed prior to it.
The current implementation in the standard library does not allow you to copy these compression objects in a useful way, so I've made some minor modifications (patch attached) to the standard 2.4.2 library: - Add copy() method to zlib compression object. This returns a new compression object with the same internal state. I named it copy() to keep it consistent with things like sha.copy(). - Add snapshot() / restore() methods to GzipFile and TarFile. These work only in write mode. snapshot() returns a state object. Passing in this state object to restore() will restore the state of the GzipFile / TarFile to the state represented by the object.
Future work: - Decompression objects could use a copy() method too - Add support for copying bzip2 compression objects
Although this patch isn't complete, does this seem like a good approach?
Cheers, Chris
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
On 2/17/06, Guido van Rossum <guido@python.org> wrote:
Please submit your patch to SourceForge.
I've submitted the zlib patch as patch #1435422. I added some test cases to test_zlib.py and documented the new methods. I'd like to test my gzip / tarfile changes more before creating a patch for it, but I'm interested in any feedback about the idea of adding snapshot() / restore() methods to the GzipFile and TarFile classes. It doesn't look like the underlying bz2 library supports copying compression / decompression streams, so for now it's impossible to make corresponding changes to the bz2 module. I also noticed that the tarfile reimplements the gzip file format when dealing with streams. Would it make sense to refactor some the gzip.py code to expose the methods that read/write the gzip file header, and have the tarfile module use those methods? Cheers, Chris
participants (2)
-
Chris AtLee
-
Guido van Rossum