a question about Chinese characters in a Python Program

Ben Finney bignose+hates-spam at benfinney.id.au
Mon Oct 20 22:27:47 CEST 2008


est <electronixtar at gmail.com> writes:

> IMHO it's even better to output wrong encodings rather than halt the
> WHOLE damn program by an exception

I can't agree with this. The correct thing to do in the face of
ambiguity is for Python to refuse to guess.

> When debugging encoding problems, the solution is simple. If
> characters display wrong, switch to another encoding, one of them
> must be right.

That's debugging problems not in the program but in the *data*, which
Python is helping with by making the problems apparent as soon as
feasible to do so.

> But it's tiring in python to deal with encodings, you have to wrap
> EVERY SINGLE character expression with try ... except ... just imagine
> what pain it is.

That sounds like a rather poor program design. Much better to sanitise
the inputs to the program at a few well-defined points, and know from
that point that the program is dealing internally with Unicode.

> Dealing with character encodings is really simple.

Given that your solutions are baroque and complicated, I don't think
even you yourself can believe that statement.

> Like I said, str() should NOT throw an exception BY DESIGN, it's a
> basic language standard.

Any code should throw an exception if the input is both ambiguous and
invalid by the documented specification.

> str() is not only a convert to string function, but also a
> serialization in most cases.(e.g. socket) My simple suggestion is:
> If it's a unicode character, output as UTF-8; other wise just ouput
> byte array, please do not encode it with really stupid range(128)
> ASCII. It's not guessing, it's totally wrong.

Your assumption would require that UTF-8 be a lowest *common*
denominator for most output devices Python will be connected to.
That's simply not the case; the lowest common denominator is still
ASCII.

I yearn for a future where all output devices can be assumed, in the
absence of other information, to understand a common Unicode encoding
(e.g. UTF-8), but we're not there yet and it would be a grave mistake
for Python to falsely behave as though we were.

-- 
 \     “I went to a fancy French restaurant called ‘Déjà Vu’. The head |
  `\                  waiter said, ‘Don't I know you?’” —Steven Wright |
_o__)                                                                  |
Ben Finney



More information about the Python-list mailing list