[XML-SIG] Re: Issues with Unicode type

Eric van der Vlist vdv@dyomedea.com
26 Sep 2002 14:32:49 +0200


On Thu, 2002-09-26 at 14:17, Martin v. Loewis wrote:
> Eric van der Vlist <vdv@dyomedea.com> writes:
>=20
> > OTH, working on implementations of standards (or recs) without aiming
> > for complete conformance is something which I consider as dangerous and
> > I am reaching a point where Python doesn't look as a adequate plateform
> > to implement W3C XML Schema datatypes (and hardly an adequate platform
> > to implement Relax NG) because of the lack of support of non BMP code
> > points.
>=20
> Please understand that Python is free software. So if it does not fit
> your needs, you can:
> a) adjust your needs, or
> b) adjust Python, or
> c) not use Python.
>=20
> It is only for non-free software where b) is no option.

Sure, sorry if I have given the impression I was complaining while I am
just trying to evaluate the situation!
>=20
> > The two issues which I am currently aware of are the length of the
> > strings which can be solved by implementing an application level length
> > algorithm and, more serious, the support of the regular expressions
> > required for the "pattern" facet for which I don't see how we could rel=
y
> > on the Python regexp features which are buggy when compiled as ucs4 and
> > will not produce the expected result when compiled as ucs2.=20
> >=20
> > Unless we rely on external C extensions such as the ones developed by
> > Daniel for libxml, I just see no way to be "natively conform"!
>=20
> I think this is a simplification: You can certainly implement the len
> algorithm without regular expressions at all:
>=20
> if sys.maxunicode =3D=3D 65535:
>   def smart_len(s):
>     l =3D 0
>     for c in s:
>       if not 0xd800 <=3D ord(i) < 0xdc00:
>         # skip high surrogates - only count the low surrogates
>         l +=3D 1
>     return l
> else:
>   smart_len =3D len
>=20
> The same applies for NCName: You do not *have* to use regular
> expressions. Instead, build a dictionary=20
>=20
> NCName =3D {}
> for char in all_ncname_chars:
>   NCName[char] =3D 1
>=20
> With that, you can test whether a character is allowed with
> NCName.has_key(char).
>=20
> > Again, we can say that it won't matter for "real life applications" and
> > that we don't care about conformance but that's a dangerous path.
>=20
> My code shows that there is a fourth option, in addition to fixing
> Python:=20
>=20
> d) work around the bugs and limitations
>=20
> Python is Turing-complete, so there is no algorithmic problem that
> cannot be solved in Python. So, saying that you cannot "natively
> conform" is an oversimplification.

Yes, but when it comes to implement the W3C XML Schema "pattern" facet
which is basically regular expressions embedded in schemas, this seems
to require rewriting a full regular expressions engine. What I meant by
"not natively conform" is that it *seems* not feasable with the builtin
re module in its current state.

Eric (just trying to see where he is stepping into)
>=20
> Regards,
> Martin
>=20
>=20
--=20
Rendez-vous =E0 Paris.
                          http://www.technoforum.fr/integ2002/index.html
------------------------------------------------------------------------
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------