[XML-SIG] Re: Issues with Unicode type
Uche Ogbuji
uche.ogbuji@fourthought.com
Mon, 23 Sep 2002 14:29:07 -0600
> On Mon, 2002-09-23 at 21:27, Uche Ogbuji wrote:
> >
> > Eric found this message where Guido does a decent job of summarizing =
the
> > various issues, though I'm not sure I agree with his conclusion:
> > =
> > http://mail.python.org/pipermail/i18n-sig/2001-June/001107.html
> > =
> > I should note that based on code Eric found in James Clark's code, Ja=
va
> > doesn't treat surrogates specially internally, either, which I guess
> > tends to bolster Guido's POV :-(
> =
> Yes... however, there seems to be *some* notion of surrogates at least
> in the unicode.__repr__() method:
> =
> >>> print "%r" % c
> u'\u10800'
Yeah. It seems that the idea has been to make the representaion machiner=
y =
smart enough to handle surrogate pairs: from Python-2.2.1/Objects/unicod=
eobjec
t.c line 1798 (basically the repr implementation):
/* Map UTF-16 surrogate pairs to Unicode \UXXXXXXXX escapes */
This is kept idempotent for round trip:
>>> c =3D u"\uD800\uDC00"
>>> len(c)
2
>>> repr(c)
"u'\\U00010000'"
>>> r =3D repr(c)
>>> roundtrip_c =3D eval(r)
>>> roundtrip_c
u'\U00010000'
>>> len(roundtrip_c)
2
>>> roundtrip_c =3D=3D c =
1
And yet len and friends are not smart enough to regognize it. I assume r=
e =
would have the same problem with ".".
This just deepens my unease at Guido's reluctance to support surrogates i=
n the =
code that handles UTF-16 in Python. The inconsistency seems ugly.
But as Tom says, it looks like this matter has been beaten to death, and =
it's =
pretty much settled. Now I see why Red Hat plumped on compiling Python w=
ith =
UTF-32 support (and wchar_t). I think it's the only route to sanity.
Having said all this, Martin is right about XML and the BMP. I'd forgott=
en.
Here you go, right out of the XML 1.0 spec:
"""
4.1 Character and Entity References
[Definition:] A character reference refers to a specific character in the=
=
ISO/IEC 10646 character set, for example one not directly accessible from=
=
available input devices.
Character Reference
[66] CharRef ::=3D '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';' [ WFC: Legal Character ]
Well-Formedness Constraint: Legal Character
Characters referred to using character references must match the producti=
on =
for Char.
"""
and so...
"""
2.2 Characters
[Definition:] A parsed entity contains text, a sequence of characters, wh=
ich =
may represent markup or character data. [Definition:] A character is an a=
tomic =
unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646]. Legal charact=
ers =
are tab, carriage return, line feed, and the legal graphic characters of =
Unicode and ISO/IEC 10646. The use of "compatibility characters", as defi=
ned =
in section 6.8 of [Unicode], is discouraged.
Character Range
[2] Char ::=3D #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | =
[#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate bloc=
ks, =
FFFE, and FFFF. */
"""
So 𐠀 is not WF XML. I'm not sure why JJC uses it.
-- =
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Apache 2.0 API - http://www-106.ibm.com/developerworks/linux/library/l-ap=
ache/
Python&XML column: Tour of Python/XML - http://www.xml.com/pub/a/2002/09/=
18/py.
html
Python/Web Services column: xmlrpclib - http://www-106.ibm.com/developerw=
orks/w
ebservices/library/ws-pyth10.html