[ python-Bugs-989185 ] unicode.width broken for combining characters

SourceForge.net noreply at sourceforge.net
Mon Jul 12 16:28:49 CEST 2004


Bugs item #989185, was opened at 2004-07-11 20:59
Message generated for change (Comment added) made by donut
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=989185&group_id=5470

Category: Unicode
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Matthew Mueller (donut)
Assigned to: M.-A. Lemburg (lemburg)
Summary: unicode.width broken for combining characters

Initial Comment:
Python 2.4a1+ (#38, Jul 11 2004, 20:36:10) 
[GCC 3.3.4 (Debian 1:3.3.4-3)] on linux2
Type "help", "copyright", "credits" or "license" for
more information.
>>> u'\u3060'.width()
2
>>> u'\u305f\u3099'.width()
4

Width should be two in both cases.

----------------------------------------------------------------------

>Comment By: Matthew Mueller (donut)
Date: 2004-07-12 07:28

Message:
Logged In: YES 
user_id=65253

TR11 says "Strictly speaking, it makes no sense to talk of
narrow and wide for neutral characters, but since for all
practical purposes they behave like Na, they are treated as
narrow characters (the same as Na) under the recommendations
below."  

In addition, the current implementation gives a width of 1
to not east asian characters.  So talking about fixing the
effect of combining characters on non-east asian charecters
is IMHO, just as applicable as combining characters on asian
text.

And for display width, I'd say it is useful when writing to
a terminal.  But not it its current form. Combining
characters obviously have no width, whether they are
"wide"(which just means they are normally combined with wide
characters) or not.


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-07-12 07:00

Message:
Logged In: YES 
user_id=38388

It would help if you would include the Unicode code point
descriptions...

01B5;LATIN CAPITAL LETTER Z WITH STROKE;Lu;0;L;;;;;N;LATIN
CAPITAL LETTER Z BAR;;;01B6;

0327;COMBINING CEDILLA;Mn;202;NSM;;;;;N;NON-SPACING CEDILLA;;;;

0308;COMBINING DIAERESIS;Mn;230;NSM;;;;;N;NON-SPACING
DIAERESIS;Dialytika;;;

Ie. your example does not even include East Asian characters.

If you read the TR11, you'll find that:
"""
ED7. Not East Asian (Neutral) - all other characters.
Neutral characters do not occur in legacy East Asian
character sets. By extension, they also do not occur in East
Asian typography.  For example, there is no traditional
Japanese way of typesetting Devanagari.

    Strictly speaking, it makes no sense to talk of narrow
and wide for neutral characters, but since for all practical
purposes they behave like Na, they are treated as narrow
characters (the same as Na) under the recommendations below.
"""

Combining marks as the ones that your example uses cannot be
processed by doing a simple database lookup. The two marks
you include are marked as A -- Ambiguous.

Furthermore, you should not mistake the East Asian Width for
the display width. It is merely a hint for rendering
engines. See the TR for details.

Hye-Shik, could you give an example of where the EAS is
actually useful in Python programming ? I hvae a feeling
that it is going to cause more confusion than do good. It
may also be wise to rename the function to
east_asian_width() to signal that the return value does not
have anything to do with a display with, glyphs, etc.

----------------------------------------------------------------------

Comment By: Matthew Mueller (donut)
Date: 2004-07-12 06:06

Message:
Logged In: YES 
user_id=65253

I don't think normalization is sufficient.  For example,
consider:

>>> u'\u01b5\u0327\u0308'.width()
3
>>> unicodedata.normalize('NFC',u'\u01b5\u0327\u0308').width()
3

But width should be one.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-07-12 02:45

Message:
Logged In: YES 
user_id=38388

To be honest: I don't really know how .width() ended up as
method.
The use context seems to be rather limited in that it only
applies to East Asian code points according to Unicode
Standard Annex #11.

I'd suggest to move the whole implementation to unicodedata
instead
(and then apply normalization before looking up the width).

Reading the UAX11 (http://www.unicode.org/reports/tr11/)
I also have a feeling that taking the sum of all
widths in a string of Unicode code points is not a very useful
approach. Since the width is mainly used for rendering East
Asian
text, only the per code point information is useful.
I think that it would be more appropriate to raise an
exception if you pass in more than one code point to the
function.


----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2004-07-11 21:46

Message:
Logged In: YES 
user_id=55188

This sounds that we need to normalize to NFC before
evaluations for unicode.width().
So, I think we'll need to choose how to use normalization
database from width() method.

1. export normalization CAPI functions from unicodedata
module like ucnhash_CAPI and unicodeobject uses it when
width() is first called.

2. move unicode.width() to unicodedata module and use
normalization functions statically.

I would prefer 2. ;)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=989185&group_id=5470


More information about the Python-bugs-list mailing list