[docs] [issue20906] Issues in Unicode HOWTO

Mon Mar 17 01:17:23 CET 2014

Graham Wideman added the comment:

> Do you want to provide a patch?

I would be happy to, but I'm not currently set up to create a patch. Also, I hoped that an author who has more history with this article would supervise, especially where I don't know what the original intent was.

> I find use of the word "narrative" intimidating in the context of a technical documentation.

Agreed. How about "In documentation such as the current article..."

> In general, I find it disappointing that the Unicode HOWTO only gives 
> hexadecimal representations of non-ASCII characters and (almost) never 
> represents them in their true form. This makes things more abstract 
> than necessary.

I concur with reducing unnecessary abstraction. No sure what you mean by "true form". Do you mean show the glyph which the code point represents? Or the sequence of bytes? Or display the code point value in decimal? 

> > This is a vague claim. Probably what was intended was: "Many 
> > Internet standards define protocols in which the data must 
> > contain no zero bytes, or zero bytes have special meaning."  
> > Is this actually true? Are there "many" such standards?

> I think it actually means that Internet protocols assume an ASCII-compatible 
> encoding (which UTF-8 is, but not UTF-16 or UTF-32 - nor EBCDIC :-)).

Ah -- yes that makes sense.

> > --> "Non-Unicode code systems usually don't handle all of 
> > the characters to be found in Unicode."

> The term *encoding* is used pervasively when dealing with the transformation 
> of unicode to/from bytes, so I find it confusing to introduce another term here 
> ("code systems"). I prefer the original sentence.

I see that my revision missed the target. There is a problem, but it is wider than this sentence.

One of the most essential points this article should make clear is the distinction between older schemes with a single mapping:

Characters <--> numbers in particular binary format. (eg: ASCII)

... versus Unicode with two levels of mapping...

Characters <--> code point numbers <--> particular binary format of the number data and sequences thereof.

In the older schemes, "encoding" referred to the one mapping: chars <--> numbers in particular binary format. In Unicode, "encoding" refers only to the mapping: code point numbers <--> binary format. It does not refer to the chars <--> code point mapping. (At least, I think that's the case. Regardless, the two mappings need to be rigorously distinguished.)

On review, there are many points in the article that muddy this up.  For example, "Unicode started out using 16-bit characters instead of 8-bit characters". Saying "so-an-so-bit characters" about Unicode, in the current article, is either wrong, or very confusing.  Unicode characters are associated with code points, NOT with any _particular_ bit level representation.

If I'm right about the preceding, then it would be good for that to be spelled out more explicitly, and used consistently throughout the article. (I won't try to list all the examples of this problem here -- too messy.)

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20906>
_______________________________________