[Python-Dev] Can the cgi module be made Unicode-aware?
Martin v. Loewis
11 Apr 2002 18:19:47 +0200
Guido van Rossum <firstname.lastname@example.org> writes:
> > "Content-Type:
> > application/x-www-form-urlencoded". Is utf-8 implied for the data
> > once the url encoding has been reversed?
> I very much doubt it. You probably received that UTF-8 data from a
> non-standard-conforming browser.
That's partially a bug in HTTP forms, partially a bug in the browsers,
and partially a bug in many CGI scripts. The original URL encoding of
form paramters (in the URL itself, using GET) does not allow a
specification of the encoding; that's the bug in HTTP.
To work around this, *all* browsers (by silent convention) send form
parameters in the encoding that the document was in. So if the
document containing the form is in UTF-8, they will send the form
parameters in UTF-8. Of course, unless you *know* what encoding the
original document had, there is no way of telling that it is UTF-8.
The RFC specifies that, if application/x-www-form-urlencoded is used,
text fields *should* have a Content-Type field, with a charset
argument. The bug in the browsers is that they omit the Content-Type
declaration for individual fields.
I've reported this bug for MSIE, Mozilla, and Opera. Some Mozilla
author told me that they tried sending a charset= parameter, and that
many Web sites broke when this is done - this is the bug in many CGI