[Python-Dev] zipfile and unicode filenames

"Martin v. Löwis" martin at v.loewis.de
Sun Jun 10 10:38:15 CEST 2007


> sys.setdefaultencoding()
> exists for a reason, wouldn't it be better if stdlib could cope with
> that at least with zipfile?

sys.setdefaultencoding just does not work. Many more things break when
you call it. It only exists because people like you insisted that it
exists.

> Also note that I'm trying to ask if zipfile should be improved, how it
> should be improved, and this possible improvement is not even for me
> (because now I know how zipfile behaves and I will work correctly with
> it, but someone else might stumble upon this very unexpectedly).

If you want to come up with a patch: sure. The zipfile module should
handle Unicode strings, encoding them in the encoding that the ZIP
specification defines (both the formal one, and the
informal-defined-by-pkwares-implementation).

The tricky question is what to do when reading in zipfiles with
non-ASCII characters (and yes, I understand that in your case
there were only ASCII characters in the file names).

> The problem was that sourcedir was unicode, and on my machine
> everything went ok multiple times. zipfile.ZipInfo.FileHeader would
> return unicode, but then when it writes it to a file it gets back to
> str (because mappings back and forth were identical). The problem
> happened when on a different machine header suddenly got byte 0x98 in
> position 10 (seems to be compress_size), which cp1251 codec couldn't
> decode. You see, arcname didn't even have unicode characters, but the
> mere fact that it was unicode made header upgrade to unicode in
> "return header + self.filename + self.extra".

Ok, now I understand. If filename is a Unicode string, header is
converted using the system encoding; depending on the exact value
of header and depending on the system encoding, this may cause
a decoding error.

This bug has been reported as http://bugs.python.org/1170311

> Because that's not supposed to work sanely when self.filename is
> unicode I'm asking if the right behavior would be to a) disallow
> unicode filenames in zipfile.ZipInfo, b) automatically convert
> filename to str in zipfile.ZipInfo, c) leave everything as it is.

The correct behavior would be b); the difficult details are what
encoding to use.

Regards,
Martin


More information about the Python-Dev mailing list