[issue20906] Issues in Unicode HOWTO

Graham Wideman report at bugs.python.org
Thu Mar 20 00:50:41 CET 2014


Graham Wideman added the comment:

Antoine:

Thanks for your comments -- this is slippery stuff.

> It's better, but how about simply "In this article"?

I was hoping to inform the reader that the hex representations are found in many articles, not just special to this one.

> [ showing the glyph ]

Agreed -- it would be good to show the glyphs mentioned. But in a way that isn't confusing if the user's web browser doesn't show it correctly.

> For all intents and purposes, iso-8859-1 and friends *are* encodings 
> (and this is how Python actually names them).

I am still mulling this over. iso-8859-1 is most literally an "encoding" in the old sense of the word (character <--> byte representation), and is not, per se, a unicode-related concept. 

I think part of the ambiguity problem here is that there are two subtly but importantly different ideas here:

1. Python string (capable of representing any unicode text) --> some full-fidelity and industry recognized unicode byte stream, like utf-8, or utf-32. I think this is legitimately described as an "encoding" of the unicode string.

versus:

2. 1. Python string --> some other code system, such as ASCII, cp1250, etc. The destination code system doesn't necessarily have anything to do with unicode, and whole ranges of unicode's characters either result in an exception, or get translated as escape sequences. Ie: This is more usefully seen as a translation operation, than "merely" encoding.

In 1, the encoding process results in data that stays within concepts defined within Unicode. In 2, encoding produces data that would be described by some code system outside of Unicode.

At the moment I think Python muddles these two ideas together, and I'm not sure how to clarify this. 

> So it should say "16-bit code points" instead, right?

I don't think Unicode code points should ever be described as having a particular number of bits. I think this is a core concept: Unicode separates the character <--> code point, and code point <--> bits/bytes mappings. 

At most, one might want to distinguish different ranges of unicode code points. Even if there is a need to distinguish code points <= 65535, I don't think this should be described as "16-bit", as it muddies the distinction between Unicode's two mappings.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20906>
_______________________________________


More information about the Python-bugs-list mailing list