[Python-Dev] zipfile and unicode filenames

Alexey Borzenkov snaury at gmail.com
Sun Jun 10 20:17:16 CEST 2007


On 6/10/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > I don't think always encoding them to utf-8 (and using bit 11 of
> > flag_bits) is a good idea, since there's a chance to create archives
> > that won't be correctly readable by programs not supporting this bit
> > (it's no secret that currently some programs just assume that
> > filenames are encoded using one of system encodings).
> I think it is also fairly uniformly agreed that these programs are
> incorrect; the official encoding of file names in a zip file is
> Windows/DOS code page 437.

Before replying to you I actually did some quick tests. I packed a
file with localized filename and then opened it using explorer and
also viewed it using the hexeditor:

   7-Zip: directory cp866, header cp866: explorer sees correct filename.
   zipfile: directory cp1251, header cp1251: explorer sees incorrect filename.
   pkzip25.exe: directory cp866, header cp1251: explorer sees correct
filenames, zipfile complains that filenames differ.
   zip.exe: directory cp1251, header cp1251: explorer sees incorrect filenames.

Also note, that modifying filename in directory with a hex editor to
cp866 made explorer see correct filenames. Another experiment with
pkzip25 showed that modifying filename in directory makes it extract
files with that filenam, i.e. it ignores header filename. The same
behavior is showed by 7-Zip.

So the general idea is that at least directory filename has some sort
of convention of using oem (dos, console) encoding on Windows, cp866
in my case. Header filenames have different encodings, and seem to be
ignored.

> I don't think that the situation on Windows is that the OEM code page
> should be used. Instead, CP 437 should be used, independent of the OEM
> code page.

And on the contrary, pkzip25 made by PKWARE Inc. themselves behaves otherwise.

> > +        filename = str(self.filename)
> That would be incorrect, as it relies on the system encoding,
> which shouldn't be relied upon.

Well, as I've seen in numerous examples above, system (or actually
dos) encoding is actually what is used by at least by three major
programs: 7-zip, pkzip25 and explorer, at least on windows.

> Plus, it would allow arbitrary
> non-string things as filenames.

Hmm... why is that bad?

> What it should do instead
> (IMO) is to encode in CP437. Bonus points if it falls back
> to the UTF-8 feature of zip files if encoding as CP437 fails.

And encoding to cp437 would be incorrect, as no currently existing
program would correctly work on non-english Windows OSes. I think that
letting the user deciding on the encoding is the right way to go here,
as you can't know what user actually wants these days, it's all too
hazy to me. And in case unicode is passed it just converts it using
ascii (or default) codec. One can specify ascii codec there
explicitly, if using system encoding is really an issue.


More information about the Python-Dev mailing list