[XML-SIG] Re: Issues with Unicode type

Eric van der Vlist vdv@dyomedea.com
23 Sep 2002 19:21:41 +0200


On Mon, 2002-09-23 at 19:12, Martin v. Loewis wrote:
> Eric van der Vlist <vdv@dyomedea.com> writes:
>=20
> > > By default Python is using UTF-16 as its Unicode encoding. The
> > > code-point that you specify, U+10800, is outside the BMP and hence is
> > > represented by two surrogate characters in UTF-16.
> >=20
> > Arg! Does that mean that by default Python isn't strictly conform to XM=
L
> > 1.0?
>=20
> No. Why do you think this? Strictly speaking, XML 1.0 defines a
> "character" as defined by ISO/IEC 10646:1993 and ISO/IEC 10646-1:2000.
> This means only characters in the Basic Multilingual Plane are allowed
> in XML. James Clark's document is, strictly speaking, ill-formed.

That's weird...

> That aside, Python does process your document, and represents the
> character U+10800 as defined in the Python language definition. So if
> you extend XML 1.0 to Unicode 3.2 in a canonical way, Python supports
> that character as specified. Any applications that want to count
> Unicode code points might need to take into account surrogates, and
> possibly might not use the len() builtin.

Yep, and that's what James Clark is doing in his Java implementation:

  public int getLength(Object obj) {
    String str =3D (String)obj;
    int len =3D str.length();
    int nSurrogatePairs =3D 0;
    for (int i =3D 0; i < len; i++)
      if (Utf16.isSurrogate1(str.charAt(i)))
	nSurrogatePairs++;
    return len - nSurrogatePairs;
  }

And I need to do the same in Python...
=20
> Notice also that U+10800 is unassigned even in Unicode 3.2.

I wonder why he has picked this value!

Thanks

Eric
--=20
Rendez-vous =E0 Paris.
                          http://www.technoforum.fr/integ2002/index.html
------------------------------------------------------------------------
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------