zipfile and unicode filenames
Hi everyone, Today I've stumbled upon a bug in my program that wasn't very straightforward to understand. The problem is that I was passing unicode filenames to zipfile.ZipFile.write and I had sys.setdefaultencoding() in effect, which resulted in a situation where most of the bytes generated in zipfile.ZipInfo.FileHeader would pass thru, except for a few, which caused codec error on another machine (where filenames got infectiously upgraded to unicode). The problem here is that it was absolutely unclear at first that I get unicode filenames passed to write, and it incorrectly accepted them silently. Is it worth to submit a bug report on this? The desired behavior here would be to either a) disallow unicode strings as arcname are raise an exception (since it is used in concatenation with raw data it is likely to cause problems because of auto upgrading raw data to unicode), or b) silently encode unicode strings to raw strings (something like if isinstance(filename, unicode): filename = filename.encode() in zipfile.ZipInfo constructor). So, should I submit a bug report, and which behavior would be actually correct?
Today I've stumbled upon a bug in my program that wasn't very straightforward to understand.
Unfortunately, it isn't straight-forward to understand your description of it, either.
The problem is that I was passing unicode filenames to zipfile.ZipFile.write and I had sys.setdefaultencoding() in effect
What do you mean here? How can sys.setdefaultencoding() be "in effect"? There is always a default encoding; did you mean you changed the default?
which resulted in a situation where most of the bytes generated in zipfile.ZipInfo.FileHeader would pass thru, except for a few, which caused codec error on another machine (where filenames got infectiously upgraded to unicode).
Was the problem that most of the bytes would pass thru, or was the problem that a few did not pass thru? Why did filenames in the FileHeader infectiously upgraded to unicode on the other machine, but not on the first machine?
The problem here is that it was absolutely unclear at first that I get unicode filenames passed to write, and it incorrectly accepted them silently. Is it worth to submit a bug report on this?
Try to let me rephrase what I understood so far: "I changed the default system encoding from ASCII to some other value, and that caused zipfile.py to generate an incorrect zipfile. Is that a bug in zipfile?" To that, the answer is a clear "no". If you change the default encoding, you are on your own. Don't do that.
So, should I submit a bug report, and which behavior would be actually correct?
The issue of non-ASCII file names in zipfiles is fairly well understood. The ZIP format historically did not support them well. I believe this has recently been improved, but that format change has not propagated into the zipfile module, yet. Howeer, everybody is aware of the situation, so there is no need to report a bug. Regards, Martin
participants (2)
-
"Martin v. Löwis"
-
Alexey Borzenkov