[Python-Dev] len(chr(i)) = 2?

Terry Reedy tjreedy at udel.edu
Tue Nov 23 23:44:07 CET 2010


On 11/23/2010 2:11 PM, Alexander Belopolsky wrote:

> This discussion motivated me to start looking into how well Python
> library itself is prepared to deal with len(chr(i)) = 2.  I was not

Good idea!

> surprised to find that textwrap does not handle the issue that well:
>
>>>> len(wrap(' \U00010140' * 80, 20))
> 12
>>>> len(wrap(' \U00000140' * 80, 20))
> 8

How well does textwrap handles composable pairs (letter + accent)? Does 
is count two codepoints as one char space? and avoid putting line breaks 
between? I suspect textwrap should be regarded as 
(extended?)_ascii_textwrap.
>
> That module should probably be rewritten to properly implement  the
> Unicode line breaking algorithm
> <http://unicode.org/reports/tr14/tr14-22.html>.

Probably a good idea

> Yet finding a bug in a str object method after a 5 min review was a
> bit discouraging:
>
>>>> 'xyz'.center(20, '\U00010140')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in<module>
> TypeError: The fill character must be exactly one character long

Again, what does it do with letter + decorator combinations? It seems to 
me that the whole notion that one code point == one printed character 
space is broken once one leaves ascii. Perhaps we need an is_uchar 
function to recognize multi-code sequences, inclusing surrogate pairs, 
that represent one char for the purpose of character oriented functions.

> Given the apparent difficulty of writing even basic text processing
> algorithms in presence of surrogate pairs, I wonder how wise it is to
> expose Python users to them.  As Wikipedia explains, [1]
>
> """
> Because the most commonly used characters are all in the Basic
> Multilingual Plane, converting between surrogate pairs and the
> original values is often not tested thoroughly. This leads to
> persistent bugs, and potential security holes, even in popular and
> well-reviewed application software.
> """

So we did not test thoroughly enough and need to add appropriate unit 
tests as bugs are fixed.


-- 
Terry Jan Reedy



More information about the Python-Dev mailing list