[XML-SIG] Re: Issues with Unicode type

Eric van der Vlist vdv@dyomedea.com
25 Sep 2002 12:13:47 +0200


On Wed, 2002-09-25 at 01:52, Uche Ogbuji wrote:
> >=20
> > Martin v. Loewis writes:
> >  > 3. Implement it properly. Please understand that you will be trading
> >  >    efficiency for correctness.
> >=20
> > I'm sure a small C extension could provide the needed helpers quite
> > efficiently.  Even with a UCS-4 version of Python, a Unicode literal
> > containing a surrogate pair (explicitly, using two \u sequences) will
> > exhibit the behavior that Eric wants to see suppressed.
>=20
> Yes.  That was what I figured to in my recent rumination on such literals=
.  My=20
> conclusion was *never* to use "naked" surrogate pairs in Unicode literals=
,=20
> even with UTF-16 Python.  I get the sense this is a "best practice" that=20
> should be clearly articulated:
>=20
> Do *not* express Unicode literals using direct UTF-16 surrogate pairs, e.=
g.=20
> u"\uD800\uDC00".  *Always* use the high-order unicode literal character f=
orm=20
> (big-U notation), e.g. u"\U00010000".

I am not 100% sure if this is the same issue, but the script [1] with
the definition of the XML productions generated by chargen [2] which I
am using in my implementation doesn't seem to work correctly on a Python
interpreter compiled with ucs4.

[1] http://downloads.xmlschemata.org/python/xvif/characters.py
[2]
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/pyxml/xml/utils/xmlchargen.p=
y

What makes me say that is the fact that with a Python interpreter
compiled with ucs4, my Relax NG implementation doesn't catch any longer
incorrect XML names such as u'\u0E35' while this is working fine with
the same version compiled for ucs2.

This can be checked quite easily:

1) with a ucs2 interpreter:

vdv@ibook:~/xmlschemata-cvs/downloads/python/xvif$ python
Python 2.2.1 (#1, Sep 13 2002, 22:38:05)=20
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import characters
>>> print characters.re_NCName().match(u'\u0E35')
None

2) with a ucs4 interpreter:

vdv@ibook:~/xmlschemata-cvs/downloads/python/xvif$ python
Python 2.2.1 (#5, Sep 25 2002, 11:18:57)=20
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import characters
>>> print characters.re_NCName().match(u'\u0E35')
<_sre.SRE_Match object at 0x10068670>

Does that mean that chargen.py should be rewritten for ucs4? Could a
single avoiding surrogates version handle both?=20

Thanks

Eric

PS: if someone could help me with chargen.py which looks like black
magic to me, I would really appreciate!
--=20
Rendez-vous =E0 Paris.
                          http://www.technoforum.fr/integ2002/index.html
------------------------------------------------------------------------
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------