Unicode and Zipfile problems
Gerson Kurz
gerson.kurz at t-online.de
Wed Nov 5 06:02:51 EST 2003
AAAAAAAARG I hate the way python handles unicode. Here is a nice
problem for y'all to enjoy: say you have a variable thats unicode
directory = u"c:\temp"
Its unicode not because you want it to, but because its for example
read from _winreg which returns unicode.
You do an os.listdir(directory). Note that all filenames returned are
now unicode. (Change introduced I believe in 2.3).
You add the filenames to a zipfile.ZipFile object. Sometimes, you will
get this exception:
Traceback (most recent call last):
File "collect_trace_info.py", line 65, in CollectTraceInfo
z.write(pathname)
File "C:\Python23\lib\zipfile.py", line 416, in write
self.fp.write(zinfo.FileHeader())
File "C:\Python23\lib\zipfile.py", line 170, in FileHeader
return header + self.filename + self.extra
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position
12:
ordinal not in range(128)
After you have regained your composure, you find the reason: "header"
is a struct.pack() generated byte string. self.filename is however a
unicode string because it is returned by os.listdir as unicode. If
"header" generates anything above 0x7F - which can but need not
happen, depending on the type of file you have an exception waiting
for yourself - sometimes. Great. (The same will probably occur if
filename contains chars > 0x7F). The problem does not occur if you
have "str" type filenames, because then no backandforth conversion is
being made.
There is a simple fix, before calling z.write() byte-encode it. Here
is a sample code:
import os, zipfile, win32api
def test(directory):
z =
zipfile.ZipFile(os.path.join(directory,"temp.zip"),"w",zipfile.ZIP_DEFLATED)
for filename in os.listdir(directory):
z.write(os.path.join(directory, filename))
z.close()
if __name__ == "__main__":
test(unicode(win32api.GetSystemDirectory()))
Note: It might work on your system, depending on the types of files.
To fix it, use
z.write(os.path.join(directory, filename).encode("latin-1"))
But to my thinking, this is a bug in zipfile.py, really.
Now, could anybody please just write a
"i-don't-care-if-my-app-can-display-klingon-characters" raw byte
encoding which doesn't throw any assertions and doesn't care whether
or not the characters are in the 0x7F range? Its ok if I cannot port
my batchscripts to swaheli, really.
More information about the Python-list
mailing list