[Python-Dev] Can the cgi module be made Unicode-aware?

Martin v. Loewis martin@v.loewis.de
11 Apr 2002 18:19:47 +0200


Guido van Rossum <guido@python.org> writes:

> > "Content-Type:
> > application/x-www-form-urlencoded".  Is utf-8 implied for the data
> > once the url encoding has been reversed?
> 
> I very much doubt it.  You probably received that UTF-8 data from a
> non-standard-conforming browser.

That's partially a bug in HTTP forms, partially a bug in the browsers,
and partially a bug in many CGI scripts. The original URL encoding of
form paramters (in the URL itself, using GET) does not allow a
specification of the encoding; that's the bug in HTTP.

To work around this, *all* browsers (by silent convention) send form
parameters in the encoding that the document was in. So if the
document containing the form is in UTF-8, they will send the form
parameters in UTF-8. Of course, unless you *know* what encoding the
original document had, there is no way of telling that it is UTF-8.

The RFC specifies that, if application/x-www-form-urlencoded is used,
text fields *should* have a Content-Type field, with a charset
argument. The bug in the browsers is that they omit the Content-Type
declaration for individual fields.

I've reported this bug for MSIE, Mozilla, and Opera. Some Mozilla
author told me that they tried sending a charset= parameter, and that
many Web sites broke when this is done - this is the bug in many CGI
scripts.

Regards,
Martin