Unicode and Zipfile problems

Gerson Kurz gerson.kurz at t-online.de
Wed Nov 5 12:02:51 CET 2003

AAAAAAAARG I hate the way python handles unicode. Here is a nice
problem for y'all to enjoy: say you have a variable thats unicode

directory = u"c:\temp"

Its unicode not because you want it to, but because its for example
read from _winreg which returns unicode.

You do an os.listdir(directory). Note that all filenames returned are
now unicode. (Change introduced I believe in 2.3).

You add the filenames to a zipfile.ZipFile object. Sometimes, you will
get this exception:

Traceback (most recent call last):
  File "collect_trace_info.py", line 65, in CollectTraceInfo
  File "C:\Python23\lib\zipfile.py", line 416, in write
  File "C:\Python23\lib\zipfile.py", line 170, in FileHeader
    return header + self.filename + self.extra
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position
ordinal not in range(128)

After you have regained your composure, you find the reason: "header"
is a struct.pack() generated byte string. self.filename is however a
unicode string because it is returned by os.listdir as unicode. If
"header" generates anything above 0x7F - which can but need not
happen, depending on the type of file you have an exception waiting
for yourself - sometimes. Great. (The same will probably occur if
filename contains chars > 0x7F). The problem does not occur if you
have "str" type filenames, because then no backandforth conversion is
being made.

There is a simple fix, before calling z.write() byte-encode it. Here
is a sample code:

import os, zipfile, win32api

def test(directory):
    z =
    for filename in os.listdir(directory):
        z.write(os.path.join(directory, filename))

if __name__ == "__main__":

Note: It might work on your system, depending on the types of files.
To fix it, use

z.write(os.path.join(directory, filename).encode("latin-1"))

But to my thinking, this is a bug in zipfile.py, really. 

Now, could anybody please just write a
"i-don't-care-if-my-app-can-display-klingon-characters" raw byte
encoding which doesn't throw any assertions and doesn't care whether
or not the characters are in the 0x7F range? Its ok if I cannot port
my batchscripts to swaheli, really. 

More information about the Python-list mailing list