Unicode in cgi-script with apache2
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sun Aug 17 03:50:48 EDT 2014
Denis McMahon wrote:
> From your other message, the error appears to be a python error on
> reading the input file. For some reason python seems to be trying to
> interpret the file it is reading as ascii.
Oh!!! /facepalm
I think you've got it. I've been assuming the problem was on *writing* the
line. That's because the OP was insistent that the line failing was
[quoting Dominique]
The problem is, when python 'prints' to the apache interface, it
translates the string to ascii.
but if you read the traceback, you're right, the problem is *reading* the
file, not printing:
[Sat Aug 16 23:12:42.158326 2014] [cgi:error] [pid 29327] [client
119.63.193.196:11110] AH01215: Traceback (most recent call last):
[Sat Aug 16 23:12:42.158451 2014] [cgi:error] [pid 29327] [client
119.63.193.196:11110] AH01215: File "/var/www/cgi-python/index.html",
line 12, in <module>
[Sat Aug 16 23:12:42.158473 2014] [cgi:error] [pid 29327] [client
119.63.193.196:11110] AH01215: for line in f:
That's the line which is failing, reading the file. Which is then *decoded*.
Files contain bytes, which have to be decoded into text, and the decode is
assuming ASCII:
[Sat Aug 16 23:12:42.158526 2014] [cgi:error] [pid 29327] [client
119.63.193.196:11110] AH01215: File
"/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
[Sat Aug 16 23:12:42.158569 2014] [cgi:error] [pid 29327] [client
119.63.193.196:11110] AH01215: return codecs.ascii_decode(input,
self.errors)[0]
[Sat Aug 16 23:12:42.158663 2014] [cgi:error] [pid 29327] [client
119.63.193.196:11110] AH01215: UnicodeDecodeError: 'ascii' codec can't
decode byte 0xc3 in position 1791: ordinal not in range(128)
> I wonder if specifying the binary data parameter and / or utf-8 encoding
> when opening the file might help.
We don't really know what encoding the index.html file is encoded in. It
might be Latin-1, or cp-1252, or some other legacy encoding. But let's
assume it's UTF-8.
So why is Dominque's script reading it in ASCII? That's the key question. I
have a sinking feeling that Apache may be running Python as a subprocess
with the C locale, maybe. I don't know enough about cgi to be more than
just guessing.
Dominique, if you write:
f = open("/var/www/cgi-data/index.html", "r", encoding='utf-8')
the problem should go away (assuming index.html is valid UTF-8). If it
doesn't, there's a very strange bug somewhere.
Please try that, and see if it fixes the problem, or if the error goes to a
different line.
> eg:
>
> f = open( "/var/www/cgi-data/index.html", "rb" )
No, you don't want that, since then reading the file will return bytes, not
text. Although I suppose the OP might just commit to using bytes
everywhere. Yuck.
> f = open( "/var/www/cgi-data/index.html", "rb", encoding="utf-8" )
That makes no sense. If you're reading in binary mode, there's no encoding.
Every byte represents itself.
> f = open( "/var/www/cgi-data/index.html", "r", encoding="utf-8" )
That's the bunny!
If you just want to hide the problem without fixing the underlying cause,
add an argument errors="replace", which is ugly but at least lets you move
on:
py> b = "Hello ë ü world".encode('utf-8')
py> print(b.decode('ascii', errors='replace'))
Hello �� �� world
--
Steven
More information about the Python-list
mailing list