[docs] [issue20906] Issues in Unicode HOWTO
report at bugs.python.org
Thu Mar 20 08:47:56 CET 2014
Marc-Andre Lemburg added the comment:
Just to clarify a few things:
On 20.03.2014 00:50, Graham Wideman wrote:
> I think part of the ambiguity problem here is that there are two subtly but importantly different ideas here:
> 1. Python string (capable of representing any unicode text) --> some full-fidelity and industry recognized unicode byte stream, like utf-8, or utf-32. I think this is legitimately described as an "encoding" of the unicode string.
Right, those are Unicode transformation format (UTF) encodings which are
capable of representing all Unicode code points.
> 2. 1. Python string --> some other code system, such as ASCII, cp1250, etc. The destination code system doesn't necessarily have anything to do with unicode, and whole ranges of unicode's characters either result in an exception, or get translated as escape sequences. Ie: This is more usefully seen as a translation operation, than "merely" encoding.
Those are encodings as well. The operation going from Unicode to one of
these encodings is called "encode" in Python. The other way around
> In 1, the encoding process results in data that stays within concepts defined within Unicode. In 2, encoding produces data that would be described by some code system outside of Unicode.
> At the moment I think Python muddles these two ideas together, and I'm not sure how to clarify this.
An encoding is a mapping of characters to ordinals, nothing more or
less. Unicode is such an encoding, but all others are as well. They
just happen to have different ranges of ordinals.
You are viewing all this from the a Unicode point of view, but please
realize that Unicode is rather new in the business and the many
other encodings Python supports have been around for decades.
>> So it should say "16-bit code points" instead, right?
> I don't think Unicode code points should ever be described as having a particular number of bits. I think this is a core concept: Unicode separates the character <--> code point, and code point <--> bits/bytes mappings.
> At most, one might want to distinguish different ranges of unicode code points. Even if there is a need to distinguish code points <= 65535, I don't think this should be described as "16-bit", as it muddies the distinction between Unicode's two mappings.
You have UCS-2 and UCS-4. UCS-2 representable in 16 bits, UCS-4
needs 21 bits, but is typically stored in 32-bit. Still,
you're right: it's better to use the correct terms UCS-2 vs. UCS-4
rather than refer to the number of bits.
Python tracker <report at bugs.python.org>
More information about the docs