[Web-SIG] Python 3: Form data encoding issues in cgi and urllib modules

Sun Apr 12 02:48:59 CEST 2009

Hi everyone,

I read through the recent archives, and I've seen some discussion on
similar topics, but not this exact topic recently, so if the solution
to these issues has already been decided, please point me to the
relevant messages.  (Also, if this isn't the most appropriate list,
please let me know!)

The first issue is that there doesn't seem to be a way to parse
x-www-form-urlencoded query strings in a character set other than
UTF-8, for example:

'premier=un&deuxi%E8me=deux' # latin-1

The urllib.parse.unquote* functions take encoding and errors
parameters, but none of the higher-level ones.  The solution to me
seems to be that functions that build on top of
it--urllib.parse.parse*, cgi.parse*, and the cgi.FieldStorage
constructor--should grow encoding and errors parameters that they pass
through to the lower-level functions.

The second issue is that the FieldStorage classes work with text input
streams.  However, with multipart/form-data posts, posted files aren't
necessarily in the same encoding as form fields, or may be binary and
not text at all.  I would suggest that FieldStorage should be changed
to take a binary input stream. For multipart forms, it should only
attempt to decode a part with the passed-in FieldStorage encoding if
the part's content type is text/plain and the content-disposition does
not specify a filename; otherwise, field.file would be a binary file,
and field.value should be bytes or non-existent.

Here is a example form submission that is currently difficult to
handle with the cgi module, posted from a page with a charset of UTF-8
and two attached files; this is similar to how a real form submission
from Safari or Firefox would look:

post_input = b"""---123
Content-Disposition: form-data; name="utf8text"

\xc2\xa1ol\xc3\xa9!
---123
Content-Disposition: form-data; name="file1"; filename="latin1.txt"
Content-Type: text/plain

Oh l\xe0 l\xe0!
---123
Content-Disposition: form-data; name="file2"; filename="binary"
Content-Type: application/octet-stream

\x80\x81\x82\x83\x84\x85\x86\x87\xad\xf0
---123--
"""

environ = {'CONTENT_LENGTH':str(len(post_input)),
    'CONTENT_TYPE': 'multipart/form-data; boundary=-123',
    'REQUEST_METHOD': 'POST'}

It's possible that the email.mime and http packages might also need
some changes made, but I haven't looked into those as much.  Also,
cgi.parse_multipart seems to be broken currently, since it uses
http.client.parse_headers which expects a bytes stream.

If there's agreement on these points, I think it would be important to
get these changes (or perhaps alternate fixes) into Python 3.1; I know
that some of the changes are backwards incompatible with 3.0, but I
think that the encoding issues in the current cgi module make it very
difficult to work with.  I'm willing to take responsibility for
submitting bug reports and patches, but could probably use a more
experienced mentor to let me know if I'm doing it wrong.

If you don't think that these changes are reasonable, I'm interested
to hear your alternate suggestions.  I strongly believe that the
current behavior is broken and needs to be changed for 3.1.

Thanks for your consideration,
Miles Kaufmann