[New-bugs-announce] [issue20906] Unicode HOWTO

Thu Mar 13 10:16:27 CET 2014

New submission from Graham Wideman:

The Unicode HOWTO article is an attempt to help users wrap their minds around Unicode. There are some opportunities for improvement. Issues presented in order of the narrative:

http://docs.python.org/3.3/howto/unicode.html

History of Character Codes
---------------------------

References to the 1980's are a bit off.

"In the mid-1980s an Apple II BASIC program..." 

Assuming the comment is about the state of play in the mid-80's, then: The Apple II appeared in 1977. By 1985 we already had Macs, and PCs running DOS, which were capable of various character sets (not to mention lowercase letters!)

"In the 1980s, almost all personal computers were 8-bit"

Both the PC (1983) and Mac (1984) had 16-bit processors.

Definitions:
------------
"Characters are abstractions": Not helpful unless one already knows what "abstraction" means in this specific context.

"the symbol for ohms (Ω) is usually drawn much like the capital letter omega (Ω) in the Greek alphabet [...] but these are two different characters that have different meanings."

Omega is a poor example for this concept. Omega is used as the identifier for a unit in the same way as "m" is used for meter, or "A" is used for ampere. Each is a specific use of a character, which, like any specific use, has a particular meaning. However, having a particular meaning doesn't necessarily require a separate character, and in the case of omega, the Unicode standard now says that the separate "ohm" character is deprecated. 

"The ohm sign is canonically equivalent to the capital omega, and normalization would remove any distinction."

http://www.unicode.org/versions/Unicode4.0.0/ch07.pdf#search=%22character%20U%2B2126%20maps%20OR%20map%20OR%20mapping%22

A better example might be the roman numerals, code points U+2160 and subsequent.

Definitions
------------

"A code point is an integer value, usually denoted in base 16."  

When trying to convey clearly the distinction between character, code point, and byte representation, the topic of "how it's denoted" is a potential distraction for the reader, so I suggest this point be a bit more explicitly  parenthetical, and less confusable with "16 bit".  Like:

"A code point value is an integer in the range 0 to over 0x10FFFF (about 1.1 million, with some 110 thousand assigned so far). In a narrative such as the current article, a code point value is usually written in hexadecimal. The Unicode standard displays code points with the notation U+265E to mean the character with value 0x265e (9822 decimal; "Black Chess Knight" character)."

(Also revise subsequent para to use same example character. I suggest not using "Ethiotic Syllable WI", because it's unfamiliar to most readers, and it muddies the topic by suggesting that Unicode in general captures _syllables_ rather than _characters_.)

Encodings:
-----------
"This sequence needs to be represented as a set of bytes"
--> ""This code point sequence needs to be represented as a sequence of bytes"

"4. Many Internet standards are defined in terms of textual data"

This is a vague claim. Probably what was intended was: "Many Internet standards define protocols in which the data must contain no zero bytes, or zero bytes have special meaning."  Is this actually true? Are there "many" such standards?

"Generally people don’t use this encoding,"
Probably "people" per se don't use any encoding, computers do.  --> "Because of these problems, other more efficient and convenient encodings have been devised and are commonly used.

For continuity, directly after that para should come the later paras starting with "UTF-8 is one of the most common".

"2. A Unicode string is turned into a string of bytes..."
--> "2. A Unicode string is turned into a sequence of bytes..."  (Ie: don't overload "string" in and article about strings and encodings.).

Create a new subhead "Converting from Unicode to non-Unicode encodings", and move under it the paras:

"Encodings don't have to..."
"Latin-1, also known as..."
"Encodings don't have to..."

But also revise:

"Encodings don’t have to handle every possible Unicode character, and most encodings don’t."

--> "Non-Unicode code systems usually don't handle all of the characters to be found in Unicode."

----------
assignee: docs at python
components: Documentation
messages: 213367
nosy: docs at python, gwideman
priority: normal
severity: normal
status: open
title: Unicode HOWTO
type: enhancement
versions: Python 2.7, Python 3.1, Python 3.2, Python 3.3, Python 3.4, Python 3.5

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20906>
_______________________________________