[XML-SIG] Re: Parsing a unicode string

Thomas B. Passin tpassin at comcast.net
Wed Oct 6 21:19:34 CEST 2004

konrad.hinsen at laposte.net wrote:
> On 05.10.2004, at 21:10, Mike Brown wrote:
>> To clarify for Konrad's benefit -
>> XML syntax is defined in terms of ISO/IEC 10646 characters.
>> XML parsing is defined in terms of encoded byte streams.
> Interesting. I had always thought of XML as a (unicode) text 
> representation of structured data, and of the encoding as a means to 
> make it compatible with the currently dominating world of byte streams.
> What does one gain by marrying XML to byte streams? If some day in the 
> future 32-bit units becomes the smallest useful ones in computing, this 
> will just cause compatibility headaches.

Well, it isn't really married to byte streams, exactly.  The xml Rec says -

"Definition: A parsed entity contains text, a sequence of characters, 
which may represent markup or character data.] [Definition: A character 
is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 
10646]. Legal characters are tab, carriage return, line feed, and the 
legal characters of Unicode and ISO/IEC 10646."

So the key things are the *sequence of characters*, and that a character 
is an iso/iec 10646 atomic unit.  It may be that as a practical matter 
of network implementation, the sequence of characters is handled as a 
stream of bytes, but the XML Rec does not say any such thing.  Of 
course, an xml processor has to be able to handle utf-8 and utf-16 
encodings, so in that sense it does have to know about byte streams.

If you generalize from byte streams to character sequences, then yes, 
that is exactly what xml is about.  That's why some people keep 
insisting that xml is "bits on the wire".


Tom P

Thomas B. Passin
Explorer's Guide to the Semantic Web (Manning Books)

More information about the XML-SIG mailing list