[docs] [issue20907] behavioral differences between shutil.unpack_archive and ZipFile.extractall
report at bugs.python.org
Thu Mar 20 10:56:39 CET 2014
Peter Santoro added the comment:
It seems clear to me that the logic in shutil._unpack_zipfile that silently skips paths that start with '/' (indicates absolute path) or that contain references to the parent directory ('..') was added to prevent malicious zip files from making potential malicious/unwanted modifications to the filesystem (perhaps at a time when zipfile did not itself contain such logic). This conservative approach works, but it can have unexpected results. For example, if all entries in a zip file contain these invalid characters, then shutil._unpack_zipfile appears to do nothing (i.e. the zip file is not unpacked). This is good (except for the silent part), if the zip file is truly malicious. However, I recently had to deal with thousands of zip files created by well known software vendors where hundreds of the zip files were created incorrectly and contained these invalid characters. These files were not malicious, but they were created improperly. Note that shutil._unpack_zipfile silently failed to unzip these files, but by using ZipFile.extractall I could unzip them.
It appears that most unzipping software today either either ignores (sometimes silently) potentially malicious zip entries (e.g. Windows 7 Explorer displays an invalid zip file error) or it attempts to filter out/replace known bad characters so that the zip entries can be extracted (e.g. WinZip, gnu unzip). I created this issue because the Python library uses both approaches, which may need rethinking.
The newer logic in ZipFile._extract_member, which is used by ZipFile.extractall, takes a different approach. Instead of silently ignoring potentially malicious zip entries, it attempts to filter out or replace known invalid characters before extracting the zip entries.
>From the Python zipfile docs:
If a member filename is an absolute path, a drive/UNC sharepoint and leading (back)slashes will be stripped, e.g.: ///foo/bar becomes foo/bar on Unix, and C:\foo\bar becomes foo\bar on Windows. And all ".." components in a member filename will be removed, e.g.: ../../foo../../ba..r becomes foo../ba..r. On Windows illegal characters (:, <, >, |, ", ?, and *) replaced by underscore (_).
As ZipFile._extract_member filters out/replaces more invalid characters than shutil._unpack_zipfile handles, one could argue that the (apparent older) approach used by shutil._unpack_zipfile is less safe.
The approaches used by shutil._unpack_zipfile and ZipFile.extractall to deal with potentially malicious zip file entries are different. This issue could be closed if not deemed important by the Python core developers or it could be handled by documentation and/or coding changes.
Python tracker <report at bugs.python.org>
More information about the docs