[XML-SIG] Re: Issues with Unicode type
Mike Brown
mike@skew.org
Mon, 23 Sep 2002 15:38:52 -0600 (MDT)
Tom Emerson wrote:
> internally Python is representing characters outside the BMP as a
> surrogate pair in UTF-16, the length of a Unicode string using these
> characters is 2 --- two UTF-16 characters.
To be pedantic, characters are on a different level of abstraction than
surrogate pairs, which are pairs of 16-bit code values.
code value != character
rather,
code value sequence (1 or more) may be equivalent to a character
In UTF-16, many characters can be represented with a single code value, but
some require two code values, both selected from a range of values that are
not individually assigned to characters.
Programming languages still take shortcuts by saying that a 'character' data
type is whatever approximate kind of code value is correct 99% of the time,
which often means you're stuck with no differentiation between the idea of a
character and a single 16-bit code value that represents it internally.
Consequently you find that len(someString) gives you not the number of
characters but the number of code values in the string. And 99% of the time,
that's fine ... until your string contains one of the other (1.1 million minus
65536) characters in Unicode.
So I think the problem here is not that Python says len(u"\uD800\uDC00") is 2
(unless somewhere it says that Python supports Unicode 3.2) but that someone
assumed len() returns a count of Unicode characters...
> If you compile your Python installation to use "wide" Unicode
> characters (i.e., UTF-32), then I expect the behavior to be
>
> >>> c = u"\U00010000"
> >>> len(c)
> 1
Agreed.
> >>> len(c)
> u'\U00010000'
I think you mean c, not len(c)
- Mike
____________________________________________________________________________
mike j. brown | xml/xslt: http://skew.org/xml/
denver/boulder, colorado, usa | resume: http://skew.org/~mike/resume/