Unicode in cgi-script with apache2
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun Aug 17 09:00:53 EDT 2014
Dominique Ramaekers wrote:
> As I suspected, if I check the used encoding in wsgi I get:
> ANSI_X3.4-1968
That's another name for ASCII.
> I found you can define the coding of the script with a special comment:
> # -*- coding: utf-8 -*-
Be careful. That just tells Python what encoding the source code file is in.
It is not used by print(), or reading/writing files, just when the compiler
reads the source code.
> Now I don't get an error but my special chars still doesn't display well.
> The script:
> # -*- coding: utf-8 -*-
> import sys
> def application(environ, start_response):
> status = '200 OK'
> output = 'Hello World! é ü à ũ'
> #output = sys.getfilesystemencoding() #1
>
> response_headers = [('Content-type', 'text/plain'),
> ('Content-Length', str(len(output)))]
> start_response(status, response_headers)
>
> return [output]
>
> Gives in the browser as output:
>
> Hello World! é ü à ũ
That looks like ordinary moji-bake. Your Python script takes the text
string 'Hello World! é ü à ũ', which in UTF-8 gives you bytes:
py> 'Hello World! é ü à ũ'.encode('utf-8')
b'Hello World! \xc3\xa9 \xc3\xbc \xc3\xa0 \xc5\xa9'
Decoding back using latin-1 gives:
py> 'Hello World! é ü à ũ'.encode('utf-8').decode('latin1')
'Hello World! é ü Ã\xa0 Å©'
which appears to be exactly what you have. Why Latin-1 instead of ASCII?
Because the process has to output *something*, and Latin-1 is sometimes
called "extended ASCII".
I'm starting to fear a bug in Python 3.4, but since I have almost no
knowledge about wsgi and cgi, I can't be sure that this isn't just normal
expected behaviour :-(
--
Steven
More information about the Python-list
mailing list