[XML-SIG] Re: Issues with Unicode type

25 Sep 2002 17:15:52 +0200

* Uche Ogbuji
|     
| SP_PAT = re.compile(u"[\uD800-\uDBFF][\uDC00-\uDFFF]")
| def smart_len(u):
|     sp_count = len(SP_PAT.findall(u))
|     return len(u) - sp_count
| 
| 
| Problem solved.

In a sense. You now have len(u""), which counts code units (thus
giving different results in UTF-16 and UTF-32 builds) and
smart_len(u""), which counds characters, and thus always gives the
same result.

Java has the same problem, in that length() there counts code units,
but on the other hand String is defined to always contain UTF-16 code
units. 

Note that this problem is also inherent in the XML family of
specifications. The DOM 1.0 definition of string was broken, while the
2.0 one equates strings with arrays of UTF-16 code units. In XPath, on
the other hand, strings consist of abstract Unicode characters...

Note also that there is one further problem. How long is this string

  u"\u0041\u030A"

according to RELAX/XPath/XSDL?

* Daniel Veillard
| 
| I don't think chars are classes but types, and hence one cannot make
| a subclass of strings whose instances could have all
| length/walk/extract operations being special cased to reflect XML
| unicode string. I (and Eric I bet) would like to be wrong on this
| :-)

This has nothing to do with XML, it's just that XML is one of the few
technologies that are sufficiently modern to make this problem show
up.  If you want to have proper Unicode support in any application you
will run into this problem.

The problem here is that the UTF-16 == Unicode assumption is built
into all sorts of technologies, from Python to Java to Ada-95 to Win32
to DOM 2.0 to ..., and in most cases people are not even aware of the
problem. 

-- 
Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TC        <URL: http://www.garshol.priv.no >