[Python-Dev] cgi.FieldStorage with multipart/form-data tries to decode binary file as UTF-8 if "filename=" not specified

Wed Feb 15 14:41:09 EST 2017

That's reasonable. This isn't an issue on Python 2.x because everything is
handled as bytes (str on 2.x). It looks like cgi.FieldStorage() only got
Unicode support in Python 3.x at all fairly "late" in the game, for 3.2 or
3.3. It was in that commit (https://github.com/python/cpy
thon/commit/5c23b8e6ea7cdb0002842d16dbce4b4d716dd35a by Victor Stinner)
where the check for "it's a binary file if it has the 'filename' parameter"
was added. It wasn't an issue before that because Unicode wasn't really
handled.

It seems to me this could be added to FieldStorage pretty easily in a
backwards-compatible way, by one of the following:

1) Adding a file_fields keyword argument to FieldStorage.__init__ which
would allow you to specify that "these fields are definitely files, please
treat them as binary". So in the case of my example on the SO question,
you'd do FieldStorage(..., file_fields=('payload',))

2) Add a .binary_file (or .file_binary) property to FieldStorage which
would be like .file but for sure give you a binary file.

3) Allowing you to specify encoding=None, which would mean don't decode
fields to str, return everything as bytes. This might not be so nice, as
then other non-file fields would be returned as bytes too, and the caller
would have to decode those manually.

Any thoughts on which of #1-3 might be best?

I also realized that to work around this, instead of my fairly complicated
content-disposition headers fix, I can subclass FieldStorage and just
override "filename" with a property, so that FieldStorage.__init__ thinks
that field does have a filename:

    class MyFieldStorage(cgi.FieldStorage):
        @property
        def filename(self):
            return 'file_name' if self.name == 'payload' else self._filename

        @filename.setter
        def filename(self, value):
            self._filename = value

Also, on StackOverflow someone responded that you can also work around this
by passing errors='surrogateescape' and then
file_input.read().encode('utf-8', 'surrogateescape') ... pretty hacky, but
at least you can get the bytes back.

-Ben

On Wed, Feb 15, 2017 at 12:35 PM, Brett Cannon <brett at python.org> wrote:

>
>
> On Wed, 15 Feb 2017 at 08:14 Ben Hoyt <benhoyt at gmail.com> wrote:
>
>> I posted this on StackOverflow [1], but I'm posting it here as well, as I
>> believe this is a bug (or at least quirk) in cgi.FieldStorage where you
>> can't access a file upload properly if "filename=" is not present in the
>> MIME part's Content-Disposition header. There are a couple of related bugs
>> open (and closed) on bugs.python.ord, but not quite this issue.
>>
>> Is it legitimate for cgi.FieldStorage to use the presence of "filename="
>> to determine "this is a binary file" (in which case this is not a bug and
>> my client is just buggy), or is this a bug? I lean towards the latter as
>> the spec indicates that the filename is optional [2].
>>
>
> Assuming this isn't a recent change in semantics I would say this is now a
> quick considering how old the module is and people probably rely on its
> current semantics.
>
> -Brett
>
>
>>
>> Copying from my StackOverflow question, including a test/repro case:
>>
>> When I use `cgi.FieldStorage` to parse a `multipart/form-data` request
>> (or any web framework like Pyramid which uses `cgi.FieldStorage`) I have
>> trouble processing file uploads from certain clients which don't provide a
>> `filename=file.ext` in the part's `Content-Disposition` header.
>>
>> If the `filename=` option is missing, `FieldStorage()` tries to decode
>> the contents of the file as UTF-8 and return a string. And obviously many
>> files are binary and not UTF-8 and as such give bogus results.
>>
>> For example:
>>
>>     >>> import cgi
>>     >>> import io
>>     >>> body = (b'--KQNTvuH-itP09uVKjjZiegh7\r\n' +
>>     ...         b'Content-Disposition: form-data; name=payload\r\n\r\n' +
>>     ...         b'\xff\xd8\xff\xe0\x00\x10JFIF')
>>     >>> env = {
>>     ...     'REQUEST_METHOD': 'POST',
>>     ...     'CONTENT_TYPE': 'multipart/form-data; boundary=KQNTvuH-
>> itP09uVKjjZiegh7',
>>     ...     'CONTENT_LENGTH': len(body),
>>     ... }
>>     >>> fs = cgi.FieldStorage(fp=io.BytesIO(body), environ=env)
>>     >>> (fs['payload'].filename, fs['payload'].file.read())
>>     (None, '����\x00\x10JFIF')
>>
>> Browsers, and *most* HTTP libraries do include the `filename=` option for
>> file uploads, but I'm currently dealing with a client that doesn't (and
>> omitting the `filename` does seem to be valid according to the spec).
>>
>> Currently I'm using a pretty hacky workaround by subclassing
>> `FieldStorage` and replacing the relevant `Content-Disposition` header with
>> one that does have the filename:
>>
>>     import cgi
>>     import os
>>
>>     class FileFieldStorage(cgi.FieldStorage):
>>         """To use, subclass FileFieldStorage and override _file_fields
>> with a tuple
>>         of the names of the file field(s). You can also override
>> _file_name with
>>         the filename to add.
>>         """
>>
>>         _file_fields = ()
>>         _file_name = 'file_name'
>>
>>         def __init__(self, fp=None, headers=None, outerboundary=b'',
>>                      environ=os.environ, keep_blank_values=0,
>> strict_parsing=0,
>>                      limit=None, encoding='utf-8', errors='replace'):
>>
>>             if self._file_fields and headers and headers.get('content-
>> disposition'):
>>                 content_disposition = headers['content-disposition']
>>                 key, pdict = cgi.parse_header(content_disposition)
>>                 if (key == 'form-data' and pdict.get('name') in
>> self._file_fields and
>>                         'filename' not in pdict):
>>                     del headers['content-disposition']
>>                     quoted_file_name = self._file_name.replace('"', '\\"')
>>                     headers['content-disposition'] = '{};
>> filename="{}"'.format(
>>                             content_disposition, quoted_file_name)
>>
>>             super().__init__(fp=fp, headers=headers,
>> outerboundary=outerboundary,
>>                              environ=environ,
>> keep_blank_values=keep_blank_values,
>>                              strict_parsing=strict_parsing, limit=limit,
>>                              encoding=encoding, errors=errors)
>>
>> Using the `body` and `env` in my first test, this works now:
>>
>>     >>> class TestFieldStorage(FileFieldStorage):
>>     ...     _file_fields = ('payload',)
>>     >>> fs = TestFieldStorage(fp=io.BytesIO(body), environ=env)
>>     >>> (fs['payload'].filename, fs['payload'].file.read())
>>     ('file_name', b'\xff\xd8\xff\xe0\x00\x10JFIF')
>>
>> Is there some way to avoid this hack and tell `FieldStorage` not to
>> decode as UTF-8? It would be nice if you could provide `encoding=None` or
>> something, but it doesn't look like it supports that.
>>
>> Thanks,
>> Ben.
>>
>>
>> [1] https://stackoverflow.com/questions/42213318/cgi-
>> fieldstorage-with-multipart-form-data-tries-to-decode-
>> binary-file-as-utf-8-e
>> [2] https://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.5.1
>>
>> _______________________________________________
>> Python-Dev mailing list
>> Python-Dev at python.org
>> https://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe: https://mail.python.org/mailman/options/python-dev/
>> brett%40python.org
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20170215/23f98a7f/attachment.html>