Copying zlib compression objects

chris.atlee at gmail.com chris.atlee at gmail.com
Tue Feb 14 17:19:28 EST 2006


I'm writing a program in python that creates tar files of a certain
maximum size (to fit onto CD/DVD).  One of the problems I'm running
into is that when using compression, it's pretty much impossible to
determine if a file, once added to an archive, will cause the archive
size to exceed the maximum size.

I believe that to do this properly, you need to copy the state of tar
file (basically the current file offset as well as the state of the
compression object), then add the file.  If the new size of the archive
exceeds the maximum, you need to restore the original state.

The critical part is being able to copy the compression object.
Without compression it is trivial to determine if a given file will
"fit" inside the archive.  When using compression, the compression
ratio of a file depends partially on all the data that has been
compressed prior to it.

The current implementation in the standard library does not allow you
to copy these compression objects in a useful way, so I've made some
minor modifications (patch follows) to the standard 2.4.2 library:
- Add copy() method to zlib compression object.  This returns a new
compression object with the same internal state.  I named it copy() to
keep it consistent with things like sha.copy().
- Add snapshot() / restore() methods to GzipFile and TarFile.  These
work only in write mode.  snapshot() returns a state object.  Passing
in this state object to restore() will restore the state of the
GzipFile / TarFile to the state represented by the object.

Future work:
- Decompression objects could use a copy() method too
- Add support for copying bzip2 compression objects

Does this seem like a good approach?

Cheers,
Chris

diff -ur Python-2.4.2.orig/Lib/gzip.py Python-2.4.2/Lib/gzip.py
--- Python-2.4.2.orig/Lib/gzip.py	2005-06-09 10:22:07.000000000 -0400
+++ Python-2.4.2/Lib/gzip.py	2006-02-14 13:12:29.000000000 -0500
@@ -433,6 +433,17 @@
         else:
             raise StopIteration

+    def snapshot(self):
+        if self.mode == READ:
+            raise IOError("Can't create a snapshot in READ mode")
+        return (self.size, self.crc, self.fileobj.tell(), self.offset,
self.compress.copy())
+
+    def restore(self, s):
+        if self.mode == READ:
+            raise IOError("Can't restore a snapshot in READ mode")
+        self.size, self.crc, offset, self.offset, self.compress = s
+        self.fileobj.seek(offset)
+        self.fileobj.truncate()

 def _test():
     # Act like gzip; with -d, act like gunzip.
diff -ur Python-2.4.2.orig/Lib/tarfile.py Python-2.4.2/Lib/tarfile.py
--- Python-2.4.2.orig/Lib/tarfile.py	2005-08-27 06:08:21.000000000
-0400
+++ Python-2.4.2/Lib/tarfile.py	2006-02-14 16:50:41.000000000 -0500
@@ -1825,6 +1825,28 @@
         """
         if level <= self.debug:
             print >> sys.stderr, msg
+
+    def snapshot(self):
+        """Save the current state of the tarfile
+        """
+        self._check("_aw")
+        if hasattr(self.fileobj, "snapshot"):
+            return self.fileobj.snapshot(), self.offset,
self.members[:]
+        else:
+            return self.fileobj.tell(), self.offset, self.members[:]
+
+    def restore(self, s):
+        """Restore the state of the tarfile from a previous snapshot
+        """
+        self._check("_aw")
+        if hasattr(self.fileobj, "restore"):
+            snapshot, self.offset, self.members = s
+            self.fileobj.restore(snapshot)
+        else:
+            offset, self.offset, self.members = s
+            self.fileobj.seek(offset)
+            self.fileobj.truncate()
+
 # class TarFile

 class TarIter:
diff -ur Python-2.4.2.orig/Modules/zlibmodule.c
Python-2.4.2/Modules/zlibmodule.c
--- Python-2.4.2.orig/Modules/zlibmodule.c	2004-12-28
15:12:31.000000000 -0500
+++ Python-2.4.2/Modules/zlibmodule.c	2006-02-14 14:05:35.000000000
-0500
@@ -653,6 +653,36 @@
     return RetVal;
 }

+PyDoc_STRVAR(comp_copy__doc__,
+"copy() -- Return a copy of the compression object.");
+
+static PyObject *
+PyZlib_copy(compobject *self, PyObject *args)
+{
+    compobject *retval;
+
+    retval = newcompobject(&Comptype);
+
+    /* Copy the zstream state */
+    /* TODO: Are the ENTER / LEAVE needed? */
+    ENTER_ZLIB
+    deflateCopy(&retval->zst, &self->zst);
+    LEAVE_ZLIB
+
+    /* Make references to the original unused_data and unconsumed_tail
+     * They're not used by compression objects so we don't have to do
+     * anything special here */
+    retval->unused_data = self->unused_data;
+    retval->unconsumed_tail = self->unconsumed_tail;
+    Py_INCREF(retval->unused_data);
+    Py_INCREF(retval->unconsumed_tail);
+
+    /* Mark it as being initialized */
+    retval->is_initialised = 1;
+
+    return (PyObject*)retval;
+}
+
 PyDoc_STRVAR(decomp_flush__doc__,
 "flush() -- Return a string containing any remaining decompressed
data.\n"
 "\n"
@@ -723,6 +753,8 @@
                  comp_compress__doc__},
     {"flush", (binaryfunc)PyZlib_flush, METH_VARARGS,
               comp_flush__doc__},
+    {"copy", (binaryfunc)PyZlib_copy, METH_VARARGS,
+              comp_copy__doc__},
     {NULL, NULL}
 };




More information about the Python-list mailing list