[Python-Dev] zipfile and unicode filenames
"Martin v. Löwis"
martin at v.loewis.de
Sun Jun 10 18:45:51 CEST 2007
> I don't think always encoding them to utf-8 (and using bit 11 of
> flag_bits) is a good idea, since there's a chance to create archives
> that won't be correctly readable by programs not supporting this bit
> (it's no secret that currently some programs just assume that
> filenames are encoded using one of system encodings).
I think it is also fairly uniformly agreed that these programs are
incorrect; the official encoding of file names in a zip file is
Windows/DOS code page 437.
> This is too
> complex and hazy to implement. Even if I know what is the situation on
> Windows (i.e. using OEM, also called DOS encoding, but I'm not sure
> how to determine its codec name from within python apart from calling
> GetConsoleCP), I'm totally unaware of the situation on other operating
I don't think that the situation on Windows is that the OEM code page
should be used. Instead, CP 437 should be used, independent of the OEM
>> The tricky question is what to do when reading in zipfiles with
>> non-ASCII characters (and yes, I understand that in your case
>> there were only ASCII characters in the file names).
> I don't think it should be changed.
In Python 3, it will certainly change, since the string type
will be unicode-based. It probably should not change for the
rest of 2.x.
> Current zipfile seems to officially support ascii filenames only
That's not true. You can use any byte string as the file name
that you want, including non-ASCII strings encoded in CP437.
> + filename = str(self.filename)
That would be incorrect, as it relies on the system encoding,
which shouldn't be relied upon. Plus, it would allow arbitrary
non-string things as filenames. What it should do instead
(IMO) is to encode in CP437. Bonus points if it falls back
to the UTF-8 feature of zip files if encoding as CP437 fails.
More information about the Python-Dev