[issue20906] Issues in Unicode HOWTO

Thu Mar 20 12:32:52 CET 2014

Marc-Andre Lemburg added the comment:

On 20.03.2014 11:49, Graham Wideman wrote:
> 
>> An encoding is a mapping of characters to ordinals, nothing more or less.
> 
> In unicode, the mapping from characters to ordinals (code points) is not the encoding. It's the mapping from code points to bytes that's the encoding. While I wish this was a distinction reserved for pedants, unfortunately it's an aspect that's important for users of unicode to understand in order to make sense of how it works, and what the literature and the web says (correct and otherwise).

I know that Unicode terminology provides all kinds of ways to name
things and we can be arbitrarily pedantic about any of them and
the fact that the Unicode consortium changes its mind every few
years isn't helpful either :-)

We could also have called encodings: "character set", "code page",
"character encoding", "transformation", etc.

In Python keep it simple: you have Unicode (code points) and 8-bit strings
or bytes (code units).

Whenever you go from Unicode to bytes, you encode Unicode into some encoding.
Going back, you decode the encoding back into Unicode. This operation is
defined by the codec implementing the encoding and it's *not* guaranteed
to be lossless.

See here for how we ended up having Unicode support in Python:

http://www.egenix.com/library/presentations/#PythonAndUnicode

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20906>
_______________________________________