[XML-SIG] Re: Issues with Unicode type

Uche Ogbuji uche.ogbuji@fourthought.com
Mon, 23 Sep 2002 14:29:07 -0600


> On Mon, 2002-09-23 at 21:27, Uche Ogbuji wrote:
> >
> > Eric found this message where Guido does a decent job of summarizing =
the
> > various issues, though I'm not sure I agree with his conclusion:
> > =

> > http://mail.python.org/pipermail/i18n-sig/2001-June/001107.html
> > =

> > I should note that based on code Eric found in James Clark's code, Ja=
va
> > doesn't treat surrogates specially internally, either, which I guess
> > tends to bolster Guido's POV  :-(
> =

> Yes... however, there seems to be *some* notion of surrogates at least
> in the unicode.__repr__() method:
> =

> >>> print "%r" % c
> u'\u10800'

Yeah.  It seems that the idea has been to make the representaion machiner=
y =

smart enough to handle surrogate pairs:  from Python-2.2.1/Objects/unicod=
eobjec
t.c line 1798 (basically the repr implementation):

/* Map UTF-16 surrogate pairs to Unicode \UXXXXXXXX escapes */

This is kept idempotent for round trip:

>>> c =3D u"\uD800\uDC00"
>>> len(c)
2
>>> repr(c)
"u'\\U00010000'"
>>> r =3D repr(c)
>>> roundtrip_c =3D eval(r)
>>> roundtrip_c
u'\U00010000'
>>> len(roundtrip_c)
2
>>> roundtrip_c =3D=3D c     =

1


And yet len and friends are not smart enough to regognize it.  I assume r=
e =

would have the same problem with ".".

This just deepens my unease at Guido's reluctance to support surrogates i=
n the =

code that handles UTF-16 in Python.  The inconsistency seems ugly.

But as Tom says, it looks like this matter has been beaten to death, and =
it's =

pretty much settled.  Now I see why Red Hat plumped on compiling Python w=
ith =

UTF-32 support (and wchar_t).  I think it's the only route to sanity.

Having said all this, Martin is right about XML and the BMP.  I'd forgott=
en.

Here you go, right out of the XML 1.0 spec:

"""
4.1 Character and Entity References

[Definition:] A character reference refers to a specific character in the=
 =

ISO/IEC 10646 character set, for example one not directly accessible from=
 =

available input devices.
Character Reference
[66]  CharRef ::=3D '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';' [ WFC: Legal Character ]

Well-Formedness Constraint: Legal Character
Characters referred to using character references must match the producti=
on =

for Char.
"""

and so...

"""
2.2 Characters

[Definition:] A parsed entity contains text, a sequence of characters, wh=
ich =

may represent markup or character data. [Definition:] A character is an a=
tomic =

unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646]. Legal charact=
ers =

are tab, carriage return, line feed, and the legal graphic characters of =

Unicode and ISO/IEC 10646. The use of "compatibility characters", as defi=
ned =

in section 6.8 of [Unicode], is discouraged.
Character Range
[2]  Char ::=3D #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | =

[#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate bloc=
ks, =

FFFE, and FFFF. */

"""

So 𐠀 is not WF XML.  I'm not sure why JJC uses it.


-- =

Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-ap=
ache/
Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/=
18/py.
html
Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerw=
orks/w
ebservices/library/ws-pyth10.html