[Python-Dev] zipfile and unicode filenames

Alexey Borzenkov snaury at gmail.com
Sun Jun 10 22:26:33 CEST 2007


On 6/10/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > So the general idea is that at least directory filename has some sort
> > of convention of using oem (dos, console) encoding on Windows, cp866
> > in my case. Header filenames have different encodings, and seem to be
> > ignored.
> Ok, then this is what the zipfile module should implement.

But this is only on Windows! I have no clue what's the common
situation on other OSes and don't even know how to sanely get OEM
codepage on Windows (the obvious way with ctypes.kernel32.GetOEMCP()
doesn't seem good to me).

So I guess that's bad idea anyway, maybe conforming to language bit is
better (ascii will stay ascii anyway).

What about this?

Index: Lib/zipfile.py
===================================================================
--- Lib/zipfile.py	(revision 55850)
+++ Lib/zipfile.py	(working copy)
@@ -252,6 +252,7 @@
             self.extract_version = max(45, self.extract_version)
             self.create_version = max(45, self.extract_version)

+        self._encodeFilename()
         header = struct.pack(structFileHeader, stringFileHeader,
                  self.extract_version, self.reserved, self.flag_bits,
                  self.compress_type, dostime, dosdate, CRC,
@@ -259,6 +260,16 @@
                  len(self.filename), len(extra))
         return header + self.filename + extra

+    def _encodeFilename(self):
+        if isinstance(self.filename, unicode):
+            self.filename = self.filename.encode('utf-8')
+            self.flag_bits = self.flag_bits | 0x800
+
+    def _decodeFilename(self):
+        if self.flag_bits & 0x800:
+            self.filename = self.filename.decode('utf-8')
+            self.flag_bits = self.flag_bits & ~0x800
+
     def _decodeExtra(self):
         # Try to decode the extra field.
         extra = self.extra
@@ -683,6 +694,7 @@
                                      t>>11, (t>>5)&0x3F, (t&0x1F) * 2 )

             x._decodeExtra()
+            x._decodeFilename()
             x.header_offset = x.header_offset + concat
             self.filelist.append(x)
             self.NameToInfo[x.filename] = x
@@ -967,6 +979,7 @@
                     extract_version = zinfo.extract_version
                     create_version = zinfo.create_version

+                zinfo._encodeFilename()
                 centdir = struct.pack(structCentralDir,
                   stringCentralDir, create_version,
                   zinfo.create_system, extract_version, zinfo.reserved,
Index: Lib/test/test_zipfile.py
===================================================================
--- Lib/test/test_zipfile.py	(revision 55850)
+++ Lib/test/test_zipfile.py	(working copy)
@@ -515,6 +515,11 @@
         # and report that the first file in the archive was corrupt.
         self.assertRaises(RuntimeError, zipf.testzip)

+    def testUnicodeFilenames(self):
+        zf = zipfile.ZipFile(TESTFN, "w")
+        zf.writestr(u"foo.txt", "Test for unicode filename")
+        zf.close()
+
     def tearDown(self):
         support.unlink(TESTFN)
         support.unlink(TESTFN2)

The problem is that I don't know if anything actually supports bit 11
at the time and can't even tell if I did this correctly or not. :(


More information about the Python-Dev mailing list