[XML-SIG] Re: Parsing a unicode string

Uche Ogbuji uche.ogbuji at fourthought.com
Sat Oct 9 22:34:35 CEST 2004

On Wed, 2004-10-06 at 01:37, konrad.hinsen at laposte.net wrote:
> On 05.10.2004, at 21:10, Mike Brown wrote:
> > To clarify for Konrad's benefit -
> >
> > XML syntax is defined in terms of ISO/IEC 10646 characters.
> > XML parsing is defined in terms of encoded byte streams.
> Interesting. I had always thought of XML as a (unicode) text 
> representation of structured data, and of the encoding as a means to 
> make it compatible with the currently dominating world of byte streams.
> What does one gain by marrying XML to byte streams? If some day in the 
> future 32-bit units becomes the smallest useful ones in computing, this 
> will just cause compatibility headaches.

Unicode is an abstraction.  It doesn't really make sense to try defining
an XML *parser* as operating on Unicode.  Python uses a special data
structure to represent Unicode.  Surely you don't expect the XML spec to
define parsing as some transformation on this data structure?

It really only makes sense to describe XML parsing in terms of byte
streams.  Now the character model, which is the *result* of parsing,
*is* defined in terms of abstract Unicode.

It's a bit twisty, but the way XML sorts this out makes perfect sense.

Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
A hands-on introduction to ISO Schematron - http://www-106.ibm.com/developerworks/edu/x-dw-xschematron-i.html
Schematron abstract patterns - http://www.ibm.com/developerworks/xml/library/x-stron.html
Wrestling HTML (using Python) - http://www.xml.com/pub/a/2004/09/08/pyxml.html
Enterprise data goes high fashion - http://www.adtmag.com/article.asp?id=10061
Principles of XML design: Considering container elements - http://www-106.ibm.com/developerworks/xml/library/x-contain.html
Hacking XML Hacks - http://www-106.ibm.com/developerworks/xml/library/x-think26.html
A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/

More information about the XML-SIG mailing list