[Python-Dev] Can the cgi module be made Unicode-aware?
Thu, 11 Apr 2002 09:29:21 -0500
>> I keep trying to handle various places in my code where I can get
>> input in non-ASCII encodings. Today I realized the cgi module does
>> nothing to translate Unicode data into unicode objects. I see in one
>> instance that I am getting data that is clearly utf-8 encoded, but I
>> see nothing in the CGI script's environment variables to suggest the
>> client web browser told the server how the data was encoded other
>> than the obvious "Content-Type: application/x-www-form-urlencoded".
>> Is utf-8 implied for the data once the url encoding has been
Guido> I very much doubt it. You probably received that UTF-8 data from
Guido> a non-standard-conforming browser.
I did some reading before nodding off last night. The <form> tag takes an
optional "accept-charset" attribute, which can be a list. By default, the
charset is "UNKNOWN", which is taken to commonly imply that the charset of
the returned data is the same as the charset of the HTML page containing the
Guido> I must be misunderstanding your question, because the answer I'm
Guido> thinking of is unicode(s,'utf8') and that can't possibly be what
Guido> you can never remember.
I eventually did figure it out. :-) What I always forget is the stinking
.encode() method to get it back to something printable. In my little dummy
script I had
print unicode(info, "utf-8")
print unicode(info, "utf-8").encode("some-encoding")
It kept raising UnicodeError. I thought it was on the conversion to
Unicode, but it was on the implicit conversion back to a printable string.
The tracebacks look similar:
Traceback (most recent call last):
File "/home/skip/tmp/junk.py", line 3, in ?
x = unicode(info)
UnicodeError: ASCII decoding error: ordinal not in range(128)
Traceback (most recent call last):
File "/home/skip/tmp/junk.py", line 4, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
I was just missing (or misinterpreting) the words "decoding" and "encoding".
Guido> (There's also an approach that tries to compare the converted to
Guido> the unconverted version and catches the exception; if no
Guido> exception is raised, the input string was pure ASCII and the
Guido> Unicode conversion is unnecessary.)
Yes, I use this technique elsewhere.
Now, back to my original problem... :-)
As far as I can tell, the underlying data encoding of the form's data is
generally going to be implicit. Adding an "accept-charset" attribute to the
<form> does appear to have some effect on Content-Type in some instances,
but not in all. I wrote a page with Latin-1 as the charset and specified
utf-8 as the charset for the form. Upon submission, Opera added a charset
attribute to the Content-Type header, Mozilla didn't. If I leave off
accept-charset for the form, neither browser adds a charset attribute to the
Content-Type header. In all cases I tried, both properly encoded the form
Can someone with access to Internet Explorer please give
a try? Does it honor the charset attribute of the form (which is currently
utf-8)? Does it add a charset to the Content-type header or not?
The cgi programmer can't rely on charset information coming from the browser
and will need a way to tell the cgi module what the charset of the incoming
data is. I think FieldStorage and MiniFieldStorage need optional charset
parameters and I think the charset needs to be used from the Content-Type
header, if present. If neither are given, I think the current behavior
should be retained (no interpretation/conversion of input data).
After a bit of reflection, I'm not so sure I want to mess with cgi.py. :-)
I'll try forcing my desired charset in my forms for the time being and see
what happens. Maybe I'll fiddle around with a FieldStorage subclass, but
that will be outside of cgi.py. I will update FAQ 4.102, however.