cgi.FieldStorage with multipart/form-data tries to decode binary file as UTF-8 if "filename=" not specified
I posted this on StackOverflow [1], but I'm posting it here as well, as I believe this is a bug (or at least quirk) in cgi.FieldStorage where you can't access a file upload properly if "filename=" is not present in the MIME part's Content-Disposition header. There are a couple of related bugs open (and closed) on bugs.python.ord, but not quite this issue. Is it legitimate for cgi.FieldStorage to use the presence of "filename=" to determine "this is a binary file" (in which case this is not a bug and my client is just buggy), or is this a bug? I lean towards the latter as the spec indicates that the filename is optional [2]. Copying from my StackOverflow question, including a test/repro case: When I use `cgi.FieldStorage` to parse a `multipart/form-data` request (or any web framework like Pyramid which uses `cgi.FieldStorage`) I have trouble processing file uploads from certain clients which don't provide a `filename=file.ext` in the part's `Content-Disposition` header. If the `filename=` option is missing, `FieldStorage()` tries to decode the contents of the file as UTF-8 and return a string. And obviously many files are binary and not UTF-8 and as such give bogus results. For example: >>> import cgi >>> import io >>> body = (b'--KQNTvuH-itP09uVKjjZiegh7\r\n' + ... b'Content-Disposition: form-data; name=payload\r\n\r\n' + ... b'\xff\xd8\xff\xe0\x00\x10JFIF') >>> env = { ... 'REQUEST_METHOD': 'POST', ... 'CONTENT_TYPE': 'multipart/form-data; boundary=KQNTvuH-itP09uVKjjZiegh7', ... 'CONTENT_LENGTH': len(body), ... } >>> fs = cgi.FieldStorage(fp=io.BytesIO(body), environ=env) >>> (fs['payload'].filename, fs['payload'].file.read()) (None, '����\x00\x10JFIF') Browsers, and *most* HTTP libraries do include the `filename=` option for file uploads, but I'm currently dealing with a client that doesn't (and omitting the `filename` does seem to be valid according to the spec). Currently I'm using a pretty hacky workaround by subclassing `FieldStorage` and replacing the relevant `Content-Disposition` header with one that does have the filename: import cgi import os class FileFieldStorage(cgi.FieldStorage): """To use, subclass FileFieldStorage and override _file_fields with a tuple of the names of the file field(s). You can also override _file_name with the filename to add. """ _file_fields = () _file_name = 'file_name' def __init__(self, fp=None, headers=None, outerboundary=b'', environ=os.environ, keep_blank_values=0, strict_parsing=0, limit=None, encoding='utf-8', errors='replace'): if self._file_fields and headers and headers.get('content-disposition'): content_disposition = headers['content-disposition'] key, pdict = cgi.parse_header(content_disposition) if (key == 'form-data' and pdict.get('name') in self._file_fields and 'filename' not in pdict): del headers['content-disposition'] quoted_file_name = self._file_name.replace('"', '\\"') headers['content-disposition'] = '{}; filename="{}"'.format( content_disposition, quoted_file_name) super().__init__(fp=fp, headers=headers, outerboundary=outerboundary, environ=environ, keep_blank_values=keep_blank_values, strict_parsing=strict_parsing, limit=limit, encoding=encoding, errors=errors) Using the `body` and `env` in my first test, this works now: >>> class TestFieldStorage(FileFieldStorage): ... _file_fields = ('payload',) >>> fs = TestFieldStorage(fp=io.BytesIO(body), environ=env) >>> (fs['payload'].filename, fs['payload'].file.read()) ('file_name', b'\xff\xd8\xff\xe0\x00\x10JFIF') Is there some way to avoid this hack and tell `FieldStorage` not to decode as UTF-8? It would be nice if you could provide `encoding=None` or something, but it doesn't look like it supports that. Thanks, Ben. [1] https://stackoverflow.com/questions/42213318/cgi-fieldstorage-with-multipart... [2] https://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.5.1
On Wed, 15 Feb 2017 at 08:14 Ben Hoyt <benhoyt@gmail.com> wrote:
I posted this on StackOverflow [1], but I'm posting it here as well, as I believe this is a bug (or at least quirk) in cgi.FieldStorage where you can't access a file upload properly if "filename=" is not present in the MIME part's Content-Disposition header. There are a couple of related bugs open (and closed) on bugs.python.ord, but not quite this issue.
Is it legitimate for cgi.FieldStorage to use the presence of "filename=" to determine "this is a binary file" (in which case this is not a bug and my client is just buggy), or is this a bug? I lean towards the latter as the spec indicates that the filename is optional [2].
Assuming this isn't a recent change in semantics I would say this is now a quick considering how old the module is and people probably rely on its current semantics. -Brett
Copying from my StackOverflow question, including a test/repro case:
When I use `cgi.FieldStorage` to parse a `multipart/form-data` request (or any web framework like Pyramid which uses `cgi.FieldStorage`) I have trouble processing file uploads from certain clients which don't provide a `filename=file.ext` in the part's `Content-Disposition` header.
If the `filename=` option is missing, `FieldStorage()` tries to decode the contents of the file as UTF-8 and return a string. And obviously many files are binary and not UTF-8 and as such give bogus results.
For example:
>>> import cgi >>> import io >>> body = (b'--KQNTvuH-itP09uVKjjZiegh7\r\n' + ... b'Content-Disposition: form-data; name=payload\r\n\r\n' + ... b'\xff\xd8\xff\xe0\x00\x10JFIF') >>> env = { ... 'REQUEST_METHOD': 'POST', ... 'CONTENT_TYPE': 'multipart/form-data; boundary=KQNTvuH-itP09uVKjjZiegh7', ... 'CONTENT_LENGTH': len(body), ... } >>> fs = cgi.FieldStorage(fp=io.BytesIO(body), environ=env) >>> (fs['payload'].filename, fs['payload'].file.read()) (None, '����\x00\x10JFIF')
Browsers, and *most* HTTP libraries do include the `filename=` option for file uploads, but I'm currently dealing with a client that doesn't (and omitting the `filename` does seem to be valid according to the spec).
Currently I'm using a pretty hacky workaround by subclassing `FieldStorage` and replacing the relevant `Content-Disposition` header with one that does have the filename:
import cgi import os
class FileFieldStorage(cgi.FieldStorage): """To use, subclass FileFieldStorage and override _file_fields with a tuple of the names of the file field(s). You can also override _file_name with the filename to add. """
_file_fields = () _file_name = 'file_name'
def __init__(self, fp=None, headers=None, outerboundary=b'', environ=os.environ, keep_blank_values=0, strict_parsing=0, limit=None, encoding='utf-8', errors='replace'):
if self._file_fields and headers and headers.get('content-disposition'): content_disposition = headers['content-disposition'] key, pdict = cgi.parse_header(content_disposition) if (key == 'form-data' and pdict.get('name') in self._file_fields and 'filename' not in pdict): del headers['content-disposition'] quoted_file_name = self._file_name.replace('"', '\\"') headers['content-disposition'] = '{}; filename="{}"'.format( content_disposition, quoted_file_name)
super().__init__(fp=fp, headers=headers, outerboundary=outerboundary, environ=environ, keep_blank_values=keep_blank_values, strict_parsing=strict_parsing, limit=limit, encoding=encoding, errors=errors)
Using the `body` and `env` in my first test, this works now:
>>> class TestFieldStorage(FileFieldStorage): ... _file_fields = ('payload',) >>> fs = TestFieldStorage(fp=io.BytesIO(body), environ=env) >>> (fs['payload'].filename, fs['payload'].file.read()) ('file_name', b'\xff\xd8\xff\xe0\x00\x10JFIF')
Is there some way to avoid this hack and tell `FieldStorage` not to decode as UTF-8? It would be nice if you could provide `encoding=None` or something, but it doesn't look like it supports that.
Thanks, Ben.
[1] https://stackoverflow.com/questions/42213318/cgi-fieldstorage-with-multipart... [2] https://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.5.1
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org
That's reasonable. This isn't an issue on Python 2.x because everything is handled as bytes (str on 2.x). It looks like cgi.FieldStorage() only got Unicode support in Python 3.x at all fairly "late" in the game, for 3.2 or 3.3. It was in that commit (https://github.com/python/cpy thon/commit/5c23b8e6ea7cdb0002842d16dbce4b4d716dd35a by Victor Stinner) where the check for "it's a binary file if it has the 'filename' parameter" was added. It wasn't an issue before that because Unicode wasn't really handled. It seems to me this could be added to FieldStorage pretty easily in a backwards-compatible way, by one of the following: 1) Adding a file_fields keyword argument to FieldStorage.__init__ which would allow you to specify that "these fields are definitely files, please treat them as binary". So in the case of my example on the SO question, you'd do FieldStorage(..., file_fields=('payload',)) 2) Add a .binary_file (or .file_binary) property to FieldStorage which would be like .file but for sure give you a binary file. 3) Allowing you to specify encoding=None, which would mean don't decode fields to str, return everything as bytes. This might not be so nice, as then other non-file fields would be returned as bytes too, and the caller would have to decode those manually. Any thoughts on which of #1-3 might be best? I also realized that to work around this, instead of my fairly complicated content-disposition headers fix, I can subclass FieldStorage and just override "filename" with a property, so that FieldStorage.__init__ thinks that field does have a filename: class MyFieldStorage(cgi.FieldStorage): @property def filename(self): return 'file_name' if self.name == 'payload' else self._filename @filename.setter def filename(self, value): self._filename = value Also, on StackOverflow someone responded that you can also work around this by passing errors='surrogateescape' and then file_input.read().encode('utf-8', 'surrogateescape') ... pretty hacky, but at least you can get the bytes back. -Ben On Wed, Feb 15, 2017 at 12:35 PM, Brett Cannon <brett@python.org> wrote:
On Wed, 15 Feb 2017 at 08:14 Ben Hoyt <benhoyt@gmail.com> wrote:
I posted this on StackOverflow [1], but I'm posting it here as well, as I believe this is a bug (or at least quirk) in cgi.FieldStorage where you can't access a file upload properly if "filename=" is not present in the MIME part's Content-Disposition header. There are a couple of related bugs open (and closed) on bugs.python.ord, but not quite this issue.
Is it legitimate for cgi.FieldStorage to use the presence of "filename=" to determine "this is a binary file" (in which case this is not a bug and my client is just buggy), or is this a bug? I lean towards the latter as the spec indicates that the filename is optional [2].
Assuming this isn't a recent change in semantics I would say this is now a quick considering how old the module is and people probably rely on its current semantics.
-Brett
Copying from my StackOverflow question, including a test/repro case:
When I use `cgi.FieldStorage` to parse a `multipart/form-data` request (or any web framework like Pyramid which uses `cgi.FieldStorage`) I have trouble processing file uploads from certain clients which don't provide a `filename=file.ext` in the part's `Content-Disposition` header.
If the `filename=` option is missing, `FieldStorage()` tries to decode the contents of the file as UTF-8 and return a string. And obviously many files are binary and not UTF-8 and as such give bogus results.
For example:
>>> import cgi >>> import io >>> body = (b'--KQNTvuH-itP09uVKjjZiegh7\r\n' + ... b'Content-Disposition: form-data; name=payload\r\n\r\n' + ... b'\xff\xd8\xff\xe0\x00\x10JFIF') >>> env = { ... 'REQUEST_METHOD': 'POST', ... 'CONTENT_TYPE': 'multipart/form-data; boundary=KQNTvuH- itP09uVKjjZiegh7', ... 'CONTENT_LENGTH': len(body), ... } >>> fs = cgi.FieldStorage(fp=io.BytesIO(body), environ=env) >>> (fs['payload'].filename, fs['payload'].file.read()) (None, '����\x00\x10JFIF')
Browsers, and *most* HTTP libraries do include the `filename=` option for file uploads, but I'm currently dealing with a client that doesn't (and omitting the `filename` does seem to be valid according to the spec).
Currently I'm using a pretty hacky workaround by subclassing `FieldStorage` and replacing the relevant `Content-Disposition` header with one that does have the filename:
import cgi import os
class FileFieldStorage(cgi.FieldStorage): """To use, subclass FileFieldStorage and override _file_fields with a tuple of the names of the file field(s). You can also override _file_name with the filename to add. """
_file_fields = () _file_name = 'file_name'
def __init__(self, fp=None, headers=None, outerboundary=b'', environ=os.environ, keep_blank_values=0, strict_parsing=0, limit=None, encoding='utf-8', errors='replace'):
if self._file_fields and headers and headers.get('content- disposition'): content_disposition = headers['content-disposition'] key, pdict = cgi.parse_header(content_disposition) if (key == 'form-data' and pdict.get('name') in self._file_fields and 'filename' not in pdict): del headers['content-disposition'] quoted_file_name = self._file_name.replace('"', '\\"') headers['content-disposition'] = '{}; filename="{}"'.format( content_disposition, quoted_file_name)
super().__init__(fp=fp, headers=headers, outerboundary=outerboundary, environ=environ, keep_blank_values=keep_blank_values, strict_parsing=strict_parsing, limit=limit, encoding=encoding, errors=errors)
Using the `body` and `env` in my first test, this works now:
>>> class TestFieldStorage(FileFieldStorage): ... _file_fields = ('payload',) >>> fs = TestFieldStorage(fp=io.BytesIO(body), environ=env) >>> (fs['payload'].filename, fs['payload'].file.read()) ('file_name', b'\xff\xd8\xff\xe0\x00\x10JFIF')
Is there some way to avoid this hack and tell `FieldStorage` not to decode as UTF-8? It would be nice if you could provide `encoding=None` or something, but it doesn't look like it supports that.
Thanks, Ben.
[1] https://stackoverflow.com/questions/42213318/cgi- fieldstorage-with-multipart-form-data-tries-to-decode- binary-file-as-utf-8-e [2] https://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.5.1
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ brett%40python.org
participants (2)
-
Ben Hoyt
-
Brett Cannon