Unicode and Zipfile problems

vincent wehren vincent at visualtrans.de
Wed Nov 5 18:20:50 CET 2003

"Gerson Kurz" <gerson.kurz at t-online.de> schrieb im Newsbeitrag
news:3fa8d5ee.304218 at news.t-online.de...
| AAAAAAAARG I hate the way python handles unicode. Here is a nice
| problem for y'all to enjoy: say you have a variable thats unicode
| directory = u"c:\temp"
| Its unicode not because you want it to, but because its for example
| read from _winreg which returns unicode.
| You do an os.listdir(directory). Note that all filenames returned are
| now unicode. (Change introduced I believe in 2.3).


That's only true if type(directory) gives you <type 'unicode'>
If you call str(directory) before doing os.listdir(directory)
you (in most cases) want even notice and can continue doing what you want to
just fine - plus, and that's the good part - you can forget about
those hacks you suggest later and which some would consider *evil*.
It'll save yourself some time too.

Hey, and leave my Swahili friends alone will ya! ;)

Vincent Wehren

| You add the filenames to a zipfile.ZipFile object. Sometimes, you will
| get this exception:
| Traceback (most recent call last):
|   File "collect_trace_info.py", line 65, in CollectTraceInfo
|     z.write(pathname)
|   File "C:\Python23\lib\zipfile.py", line 416, in write
|     self.fp.write(zinfo.FileHeader())
|   File "C:\Python23\lib\zipfile.py", line 170, in FileHeader
|     return header + self.filename + self.extra
| UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position
| 12:
| ordinal not in range(128)
| After you have regained your composure, you find the reason: "header"
| is a struct.pack() generated byte string. self.filename is however a
| unicode string because it is returned by os.listdir as unicode. If
| "header" generates anything above 0x7F - which can but need not
| happen, depending on the type of file you have an exception waiting
| for yourself - sometimes. Great. (The same will probably occur if
| filename contains chars > 0x7F). The problem does not occur if you
| have "str" type filenames, because then no backandforth conversion is
| being made.
| There is a simple fix, before calling z.write() byte-encode it. Here
| is a sample code:
| import os, zipfile, win32api
| def test(directory):
|     z =
|     for filename in os.listdir(directory):
|         z.write(os.path.join(directory, filename))
|     z.close()
| if __name__ == "__main__":
|     test(unicode(win32api.GetSystemDirectory()))
| Note: It might work on your system, depending on the types of files.
| To fix it, use
| z.write(os.path.join(directory, filename).encode("latin-1"))
| But to my thinking, this is a bug in zipfile.py, really.
| Now, could anybody please just write a
| "i-don't-care-if-my-app-can-display-klingon-characters" raw byte
| encoding which doesn't throw any assertions and doesn't care whether
| or not the characters are in the 0x7F range? Its ok if I cannot port
| my batchscripts to swaheli, really.

More information about the Python-list mailing list