[Python-Dev] Can the cgi module be made Unicode-aware?

Thu, 11 Apr 2002 09:29:21 -0500

    >> I keep trying to handle various places in my code where I can get
    >> input in non-ASCII encodings.  Today I realized the cgi module does
    >> nothing to translate Unicode data into unicode objects.  I see in one
    >> instance that I am getting data that is clearly utf-8 encoded, but I
    >> see nothing in the CGI script's environment variables to suggest the
    >> client web browser told the server how the data was encoded other
    >> than the obvious "Content-Type: application/x-www-form-urlencoded".
    >> Is utf-8 implied for the data once the url encoding has been
    >> reversed?

    Guido> I very much doubt it.  You probably received that UTF-8 data from
    Guido> a non-standard-conforming browser.

I did some reading before nodding off last night.  The <form> tag takes an
optional "accept-charset" attribute, which can be a list.  By default, the
charset is "UNKNOWN", which is taken to commonly imply that the charset of
the returned data is the same as the charset of the HTML page containing the
form.

    Guido> I must be misunderstanding your question, because the answer I'm
    Guido> thinking of is unicode(s,'utf8') and that can't possibly be what
    Guido> you can never remember.

I eventually did figure it out. :-) What I always forget is the stinking
.encode() method to get it back to something printable.  In my little dummy
script I had

    print unicode(info, "utf-8")

instead of

    print unicode(info, "utf-8").encode("some-encoding")

It kept raising UnicodeError.  I thought it was on the conversion to
Unicode, but it was on the implicit conversion back to a printable string.
The tracebacks look similar:

    Traceback (most recent call last):
      File "/home/skip/tmp/junk.py", line 3, in ?
        x = unicode(info)
    UnicodeError: ASCII decoding error: ordinal not in range(128)

vs.

    Traceback (most recent call last):
      File "/home/skip/tmp/junk.py", line 4, in ?
        print x
    UnicodeError: ASCII encoding error: ordinal not in range(128)

I was just missing (or misinterpreting) the words "decoding" and "encoding".

    Guido> (There's also an approach that tries to compare the converted to
    Guido> the unconverted version and catches the exception; if no
    Guido> exception is raised, the input string was pure ASCII and the
    Guido> Unicode conversion is unnecessary.)

Yes, I use this technique elsewhere.

Now, back to my original problem... :-)

As far as I can tell, the underlying data encoding of the form's data is
generally going to be implicit.  Adding an "accept-charset" attribute to the
<form> does appear to have some effect on Content-Type in some instances,
but not in all.  I wrote a page with Latin-1 as the charset and specified
utf-8 as the charset for the form.  Upon submission, Opera added a charset
attribute to the Content-Type header, Mozilla didn't.  If I leave off
accept-charset for the form, neither browser adds a charset attribute to the
Content-Type header.  In all cases I tried, both properly encoded the form
data though.

Can someone with access to Internet Explorer please give

    http://manatee.mojam.com/~skip/sample_form.html

a try?  Does it honor the charset attribute of the form (which is currently
utf-8)?  Does it add a charset to the Content-type header or not?

The cgi programmer can't rely on charset information coming from the browser
and will need a way to tell the cgi module what the charset of the incoming
data is.  I think FieldStorage and MiniFieldStorage need optional charset
parameters and I think the charset needs to be used from the Content-Type
header, if present.  If neither are given, I think the current behavior
should be retained (no interpretation/conversion of input data).

After a bit of reflection, I'm not so sure I want to mess with cgi.py.  :-)
I'll try forcing my desired charset in my forms for the time being and see
what happens.  Maybe I'll fiddle around with a FieldStorage subclass, but
that will be outside of cgi.py.  I will update FAQ 4.102, however.

Skip