[issue20906] Issues in Unicode HOWTO

Graham Wideman report at bugs.python.org
Sat Mar 22 11:55:23 CET 2014


Graham Wideman added the comment:

@Andre: 

_I_ know more or less the explanations behind all this. I am just putting it forward as an example which touches several concepts which are needed to explain it, and that a programmer might reason with to change a program (or the environment) to produce some output (instead of an exception), and possibly even the intended output.

For example, behind the brief explanation you provide, here are some of the related concepts:

1. print(s) sends output to stdout, which sends data to windows console (cmd.exe).

2. In the process, the output function that print --> stdout invokes attempts to encode s according to the encoding that the destination, cmd.exe reports that it expects.

3. On Windows (in English, or perhaps it's US locale), cmd.exe defaults to expecting encoding cp437.

4. cp437 is an encoding containing only 256 characters. Many Unicode code points obviously have no corresponding character in cp437.

5. The encoding process used by print() is set to exception on characters that have no mapping to the encoding wanted by stdout.

6. Consequently, print() throws an exception on code points outside of those representable in cp437.

Based on that, there are a number of moves the programmer might make, with varying results... possibly involving:

-- s.encode([various choices of options here]) --> s_as_bytes
-- print(s_as_bytes) (noting that 'Hello ' + s_as_bytes doesn't work)
-- Or maybe ascii(s)
-- Or possibly sys.stdout.buffer.write()

-- Pros and cons of the above, which require careful tracking of what the resulting strings or byte sequences "really mean" at each juncture.

-- cmd.exe chcp 65001 --> so print(unicode) won't exception, but still many chars will show as [?]
-- various font choices in cmd.exe which might be able to show the needed graphemes.
-- Automatic font substitution that occurs in some contexts when the selected font doesn't contain a requested code point and its grapheme.

... and probably more concepts that I've missed.

-- Graham

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20906>
_______________________________________


More information about the Python-bugs-list mailing list