[Python-Dev] Can the cgi module be made Unicode-aware?

Skip Montanaro skip@pobox.com
Thu, 11 Apr 2002 09:29:21 -0500

    >> I keep trying to handle various places in my code where I can get
    >> input in non-ASCII encodings.  Today I realized the cgi module does
    >> nothing to translate Unicode data into unicode objects.  I see in one
    >> instance that I am getting data that is clearly utf-8 encoded, but I
    >> see nothing in the CGI script's environment variables to suggest the
    >> client web browser told the server how the data was encoded other
    >> than the obvious "Content-Type: application/x-www-form-urlencoded".
    >> Is utf-8 implied for the data once the url encoding has been
    >> reversed?

    Guido> I very much doubt it.  You probably received that UTF-8 data from
    Guido> a non-standard-conforming browser.

I did some reading before nodding off last night.  The <form> tag takes an
optional "accept-charset" attribute, which can be a list.  By default, the
charset is "UNKNOWN", which is taken to commonly imply that the charset of
the returned data is the same as the charset of the HTML page containing the

    Guido> I must be misunderstanding your question, because the answer I'm
    Guido> thinking of is unicode(s,'utf8') and that can't possibly be what
    Guido> you can never remember.

I eventually did figure it out. :-) What I always forget is the stinking
.encode() method to get it back to something printable.  In my little dummy
script I had

    print unicode(info, "utf-8")

instead of

    print unicode(info, "utf-8").encode("some-encoding")

It kept raising UnicodeError.  I thought it was on the conversion to
Unicode, but it was on the implicit conversion back to a printable string.
The tracebacks look similar:

    Traceback (most recent call last):
      File "/home/skip/tmp/junk.py", line 3, in ?
        x = unicode(info)
    UnicodeError: ASCII decoding error: ordinal not in range(128)


    Traceback (most recent call last):
      File "/home/skip/tmp/junk.py", line 4, in ?
        print x
    UnicodeError: ASCII encoding error: ordinal not in range(128)

I was just missing (or misinterpreting) the words "decoding" and "encoding".

    Guido> (There's also an approach that tries to compare the converted to
    Guido> the unconverted version and catches the exception; if no
    Guido> exception is raised, the input string was pure ASCII and the
    Guido> Unicode conversion is unnecessary.)

Yes, I use this technique elsewhere.

Now, back to my original problem... :-)

As far as I can tell, the underlying data encoding of the form's data is
generally going to be implicit.  Adding an "accept-charset" attribute to the
<form> does appear to have some effect on Content-Type in some instances,
but not in all.  I wrote a page with Latin-1 as the charset and specified
utf-8 as the charset for the form.  Upon submission, Opera added a charset
attribute to the Content-Type header, Mozilla didn't.  If I leave off
accept-charset for the form, neither browser adds a charset attribute to the
Content-Type header.  In all cases I tried, both properly encoded the form
data though.

Can someone with access to Internet Explorer please give


a try?  Does it honor the charset attribute of the form (which is currently
utf-8)?  Does it add a charset to the Content-type header or not?

The cgi programmer can't rely on charset information coming from the browser
and will need a way to tell the cgi module what the charset of the incoming
data is.  I think FieldStorage and MiniFieldStorage need optional charset
parameters and I think the charset needs to be used from the Content-Type
header, if present.  If neither are given, I think the current behavior
should be retained (no interpretation/conversion of input data).

After a bit of reflection, I'm not so sure I want to mess with cgi.py.  :-)
I'll try forcing my desired charset in my forms for the time being and see
what happens.  Maybe I'll fiddle around with a FieldStorage subclass, but
that will be outside of cgi.py.  I will update FAQ 4.102, however.
