[docs] [issue20906] Issues in Unicode HOWTO

Graham Wideman report at bugs.python.org
Fri Mar 21 03:30:19 CET 2014


Graham Wideman added the comment:

Marc-Andre: Thanks for your latest comments. 

> We could also have called encodings: "character set", "code page",
> "character encoding", "transformation", etc.

I concur with you that things _could_ be called all sorts of names, and the choices may be arbitrary. However, creating a clear explanation requires figuring out the distinct things of interest in the domain, picking terms for those things that are distinct, and then using those terms rigorously. (Usage in the field may vary, which in itself may warrant comment.)

I read your slide deck/time-capsule-from-2002,  with interest, on a number of points. (I realize that you were involved in the Python 2.x implementation of Unicode. Not sure about 3.x?)

Page 8 "What is a Character?" is lovely, showing very explicitly Unicode's two levels of mapping, and giving names to the separate parts. It strongly suggests this HOWTO page needs a similar figure.

That said, there are a few notes to make on that slide, useful in trying to arrive at consistent terms: 

1. The figure shows a more precise word for "what users regard as a character", namely "grapheme". I'd forgotten that.

2. It shows e-accent-acute to demonstrate a pair of code points representing a single grapheme. That's important, but should avoid suggesting this as the only way to form e-accent-acute (canonical equivalence, U+00E9).

3. The illustration identifies the series of code points (the middle row) as "the Unicode encoding of the string". Ie: The grapheme-to-code-points mapping is described as an encoding. Not a wrong use of general language. But inconsistent with the mapping that encode() pertains to. (And I don't think that the code-point-to-grapheme transform is ever called "decoding", but I could be wrong.)

4. The illustration of Code Units (in the third row) shows graphemes for the Code Units (byte values). This confusingly glosses over the fact that those graphemes correspond to what you would see if you _decoded_ these byte values using CP1252 or ISO 8859-1, suggesting that the result is reasonable or useful. It certainly happens that people do this, deliberately or accidentally, but it is a misuse of the data, and should be warned against, or at least explained as a confusion.

Returning to your most recent message:

> In Python keep it simple: you have Unicode (code points) and 
> 8-bit strings or bytes (code units).

I wish it _were_ that simple. And I agree that, in principle, (assuming Python 3+) there should "inside your program" where you have the str type which always acts as sequences of Unicode code points, and has string functions. And then there's "outside your program", where text is represented by sequences of bytes that specify or imply some encoding. And your program should use supplied library functions to mostly automatically convert on the way in and on the way out.

But there are enough situations where the Python programmer, having adopted Python 3's string = Unicode approach, sees unexpected results. That prompts reading this page, which is called upon to make the fine distinctions to allow figuring out what's going on.

I'm not sure what you mean by "8-bit strings" but I'm pretty sure that's not an available type in Python 3+. Ie: Some functions (eg: encode()) produce sequences of bytes, but those don't work entirely like strs. 

-----------
This discussion to try to revise the article piecemeal has become pretty diffuse, with perhaps competing notions of purpose, and what level of detail and precision are needed etc. I will try to suggest something productive in a subsequent message.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20906>
_______________________________________


More information about the docs mailing list