Unicode in cgi-script with apache2

Sun Aug 17 06:05:21 EDT 2014

Le dimanche 17 août 2014 09:50:48 UTC+2, Steven D'Aprano a écrit :
> 
> 
> 
> 
> py> b = "Hello ë ü world".encode('utf-8')
> 
> py> print(b.decode('ascii', errors='replace'))
> 
> Hello �� �� world
> 
> 
> 

=========

No. Your are taking the problem in the wrong way. This is
a typical situation, where the produced code will work
correctly, but it will be a "just for me working code".

The mistake is that, in that way you are producing code,
that is not suitable for the "system" that will host your
string.

In the present case, you are already assuming prior
any string manipulation, the output should be utf-8.

D:\>c:\python32\python
Python 3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> b = "Hello ë ü world".encode('utf-8')
>>> b
b'Hello \xc3\xab \xc3\xbc world'
>>> b.decode('ascii', 'replace')
'Hello \ufffd\ufffd \ufffd\ufffd world'
>>> print(b.decode('ascii', 'replace'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python32\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 6-7: cha
racter maps to <undefined>
>>>

The proper way is to "prepare" your string prior any
further manipulation (see my previous comment with
processes).

I'm using explicitely the code page cp850 and the
euro sign.

>>> u = "Hello ë ü world \u20ac\u20ac\u20ac"
>>> newu = u.encode('cp850', 'replace').decode('cp850')
>>> print(newu)
Hello ë ü world ???
>>> type(newu)
<class 'str'>
>>>

The replacement character now belongs to the set of the
characters, which are display-able.
It will never fail.

You can mimic the same behaviour with a web navigator.

Create an html file in utf-8 containing characters
not belonging to iso-8859-1.
Display that file and change the coding of the nagivator
to iso-8859-1.
You will see, the navigator "reencode* the source with
a replacement char and only later re-display it. Same
process I gave above.

The key point is the detection, if doable, of the coding scheme
that should be used.

>>> import sys
>>> sys.stdout.encoding
'cp850'
>>>

My example is not Windows specific. On a gb**** Chinese
BSD or a kio-8 Russion linux: identical problematic.

jmf