[ python-Bugs-989185 ] unicode.width broken for combining characters

SourceForge.net noreply at sourceforge.net
Thu Jul 15 19:36:35 CEST 2004


Bugs item #989185, was opened at 2004-07-12 05:59
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=989185&group_id=5470

Category: Unicode
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Matthew Mueller (donut)
Assigned to: M.-A. Lemburg (lemburg)
Summary: unicode.width broken for combining characters

Initial Comment:
Python 2.4a1+ (#38, Jul 11 2004, 20:36:10) 
[GCC 3.3.4 (Debian 1:3.3.4-3)] on linux2
Type "help", "copyright", "credits" or "license" for
more information.
>>> u'\u3060'.width()
2
>>> u'\u305f\u3099'.width()
4

Width should be two in both cases.

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2004-07-15 19:36

Message:
Logged In: YES 
user_id=38388

Martin, you can code such a function in your application
based on the information you'd get from
unicodedata.east_asian_width(). As we've seen, there is no
generally sound way to define such a function.

As for the location of the data: the unicodedata module is
the place where any extra information related to Unicode
should go. unicodectype.c is reserved for data needed at C
level by the Python Unicode C API.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2004-07-15 19:24

Message:
Logged In: YES 
user_id=21627

I still think a function is useful which computes the number
of Ems that a conventional application would expect. That
function should raise an exception for a neutral character -
the example of the combining characters shows that such
characters should *not* be treated as narrow "for all
practical purposes".

Whether or not it is useful to include the entire UAX#11
classification in the database I don't know - it seems the
only application of the data would be computation of the
width, anyway.

It would not be wise to move the data to the unicode
database, as the extra data currently don't affect Python
programs that don't use the function, anyway - the data does
not consume any additional space.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-07-15 11:39

Message:
Logged In: YES 
user_id=38388

Hye-Shik, the patch only includes the move to unicodedata, but
not the full implementation of the EAW as per the TR. I would
much prefer to have the east_asian_width() function return
the strings defined in the TR because this allows users of the
function to read the information and implement their own 
interpretation of "width".

The new function should work very much like 
unicodedata.category().

It would also be wise to move the data itself over to the
unicode
database - that way the extra data does not affect Python
programs that don't use the function.

Thanks.

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2004-07-14 17:15

Message:
Logged In: YES 
user_id=55188

Marc-Andre, here's a patch written as you've suggested.
Can you please give a review on this?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-07-12 21:41

Message:
Logged In: YES 
user_id=38388

Thanks for your descriptions, Hye-Shik.

Since the application space is very much targetted at East
Asian scripts, I would like the implementation to be moved
into unicodedata where all the other special Unicode
features are implemented. The .width() method should be removed.

Now that I understand better what the EAW is about, I would
also like to see the function be renamed to
east_asian_width() since that's what the function is based on.

If possible, I'd also rather like to see the full width
mapping implemented (as defined in the TR). The reduction to
narrow vs. wide seems to be oversimplified. The
east_asian_width() function should return the characters:
"N", "A", "H", "W", "F", "Na" and let the user decide how to
map these to character or string widths. We have followed
the same methodology for the other Unicode database
properties and this has not only given us much more
flexibility, it also is standards compliant and you can get
good documentation on these features.

Matthew, I suggest you write your own implementation of what
you think is right. In the face of ambiguity, there's no
such thing as the right approach to a certain problem.

Thanks.

----------------------------------------------------------------------

Comment By: Matthew Mueller (donut)
Date: 2004-07-12 18:03

Message:
Logged In: YES 
user_id=65253

My complaint was that you were attacking my example using
non-asian characters, when the TR specifically says they are
handled as narrow.

I write Asian characters the same as anything else.  If it's
a unicode string python converts it with sys.stdout.encoding
(for print anyway).  Otherwise you just have to write in
whatever encoding the terminal expects.

And when you are talking about a fixed-width text terminal,
wide characters take 2 columns, narrow take 1. Assuming you
ignore combining characters, which is what this is all
about.  You said the width is only a "hint for rendering
engines", but I cannot think of any rendering engine that
would benefit from a hint that can be 2-3x wider (due to
counting combining characters) than when you actually
display it.

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2004-07-12 17:53

Message:
Logged In: YES 
user_id=55188

Major usages that I expected for width() are:

- Hints for terminal-based applications (for cursor position
and layouts)

- To generate fixed-width text documents not ugly: eg.
printing "------" decoration under each subjects.

- More readable limit for table columns: eg. topics in web
bulletins; limiting by same 'characters' will recur very
wide some topic lines full of East Asian characters and
narrow snipped english topics.

- To locate "^" in correct position on Python tracebacks.
This isn't implemented in standard traceback, but width()
allows 3rd party can implement sys.excepthook for East Asian
easily.

In fact, I don't known if width() can easily modified to
support variety of combining characters from Western
characters. But if it isn't too heavy or complicated, I
would volunteer to extend the width() implemention to make
it provide generic fixed-width rendering hint.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-07-12 17:23

Message:
Logged In: YES 
user_id=38388

I don't understand your complaint: width 1 means "narrow"
just as defined in the TR ?!

How do you write Asian characters to a terminal ?

I think you are mixing glyphs with code points here.

----------------------------------------------------------------------

Comment By: Matthew Mueller (donut)
Date: 2004-07-12 16:28

Message:
Logged In: YES 
user_id=65253

TR11 says "Strictly speaking, it makes no sense to talk of
narrow and wide for neutral characters, but since for all
practical purposes they behave like Na, they are treated as
narrow characters (the same as Na) under the recommendations
below."  

In addition, the current implementation gives a width of 1
to not east asian characters.  So talking about fixing the
effect of combining characters on non-east asian charecters
is IMHO, just as applicable as combining characters on asian
text.

And for display width, I'd say it is useful when writing to
a terminal.  But not it its current form. Combining
characters obviously have no width, whether they are
"wide"(which just means they are normally combined with wide
characters) or not.


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-07-12 16:00

Message:
Logged In: YES 
user_id=38388

It would help if you would include the Unicode code point
descriptions...

01B5;LATIN CAPITAL LETTER Z WITH STROKE;Lu;0;L;;;;;N;LATIN
CAPITAL LETTER Z BAR;;;01B6;

0327;COMBINING CEDILLA;Mn;202;NSM;;;;;N;NON-SPACING CEDILLA;;;;

0308;COMBINING DIAERESIS;Mn;230;NSM;;;;;N;NON-SPACING
DIAERESIS;Dialytika;;;

Ie. your example does not even include East Asian characters.

If you read the TR11, you'll find that:
"""
ED7. Not East Asian (Neutral) - all other characters.
Neutral characters do not occur in legacy East Asian
character sets. By extension, they also do not occur in East
Asian typography.  For example, there is no traditional
Japanese way of typesetting Devanagari.

    Strictly speaking, it makes no sense to talk of narrow
and wide for neutral characters, but since for all practical
purposes they behave like Na, they are treated as narrow
characters (the same as Na) under the recommendations below.
"""

Combining marks as the ones that your example uses cannot be
processed by doing a simple database lookup. The two marks
you include are marked as A -- Ambiguous.

Furthermore, you should not mistake the East Asian Width for
the display width. It is merely a hint for rendering
engines. See the TR for details.

Hye-Shik, could you give an example of where the EAS is
actually useful in Python programming ? I hvae a feeling
that it is going to cause more confusion than do good. It
may also be wise to rename the function to
east_asian_width() to signal that the return value does not
have anything to do with a display with, glyphs, etc.

----------------------------------------------------------------------

Comment By: Matthew Mueller (donut)
Date: 2004-07-12 15:06

Message:
Logged In: YES 
user_id=65253

I don't think normalization is sufficient.  For example,
consider:

>>> u'\u01b5\u0327\u0308'.width()
3
>>> unicodedata.normalize('NFC',u'\u01b5\u0327\u0308').width()
3

But width should be one.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-07-12 11:45

Message:
Logged In: YES 
user_id=38388

To be honest: I don't really know how .width() ended up as
method.
The use context seems to be rather limited in that it only
applies to East Asian code points according to Unicode
Standard Annex #11.

I'd suggest to move the whole implementation to unicodedata
instead
(and then apply normalization before looking up the width).

Reading the UAX11 (http://www.unicode.org/reports/tr11/)
I also have a feeling that taking the sum of all
widths in a string of Unicode code points is not a very useful
approach. Since the width is mainly used for rendering East
Asian
text, only the per code point information is useful.
I think that it would be more appropriate to raise an
exception if you pass in more than one code point to the
function.


----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2004-07-12 06:46

Message:
Logged In: YES 
user_id=55188

This sounds that we need to normalize to NFC before
evaluations for unicode.width().
So, I think we'll need to choose how to use normalization
database from width() method.

1. export normalization CAPI functions from unicodedata
module like ucnhash_CAPI and unicodeobject uses it when
width() is first called.

2. move unicode.width() to unicodedata module and use
normalization functions statically.

I would prefer 2. ;)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=989185&group_id=5470


More information about the Python-bugs-list mailing list