[Python-Dev] zipfile and unicode filenames
"Martin v. Löwis"
martin at v.loewis.de
Sun Jun 10 10:38:15 CEST 2007
> exists for a reason, wouldn't it be better if stdlib could cope with
> that at least with zipfile?
sys.setdefaultencoding just does not work. Many more things break when
you call it. It only exists because people like you insisted that it
> Also note that I'm trying to ask if zipfile should be improved, how it
> should be improved, and this possible improvement is not even for me
> (because now I know how zipfile behaves and I will work correctly with
> it, but someone else might stumble upon this very unexpectedly).
If you want to come up with a patch: sure. The zipfile module should
handle Unicode strings, encoding them in the encoding that the ZIP
specification defines (both the formal one, and the
The tricky question is what to do when reading in zipfiles with
non-ASCII characters (and yes, I understand that in your case
there were only ASCII characters in the file names).
> The problem was that sourcedir was unicode, and on my machine
> everything went ok multiple times. zipfile.ZipInfo.FileHeader would
> return unicode, but then when it writes it to a file it gets back to
> str (because mappings back and forth were identical). The problem
> happened when on a different machine header suddenly got byte 0x98 in
> position 10 (seems to be compress_size), which cp1251 codec couldn't
> decode. You see, arcname didn't even have unicode characters, but the
> mere fact that it was unicode made header upgrade to unicode in
> "return header + self.filename + self.extra".
Ok, now I understand. If filename is a Unicode string, header is
converted using the system encoding; depending on the exact value
of header and depending on the system encoding, this may cause
a decoding error.
This bug has been reported as http://bugs.python.org/1170311
> Because that's not supposed to work sanely when self.filename is
> unicode I'm asking if the right behavior would be to a) disallow
> unicode filenames in zipfile.ZipInfo, b) automatically convert
> filename to str in zipfile.ZipInfo, c) leave everything as it is.
The correct behavior would be b); the difficult details are what
encoding to use.
More information about the Python-Dev