[XML-SIG] Re: Parsing a unicode string
Mike Brown
mike at skew.org
Tue Oct 5 21:10:16 CEST 2004
Fredrik Lundh wrote:
> > I'd also expect parsers to accept unicode string objects with no encoding specification
> > whatsoever. Decoding a Unicode encoding and parsing XML are two distinct steps
>
> not really; XML is defined in terms of encoded bytestreams.
To clarify for Konrad's benefit -
XML syntax is defined in terms of ISO/IEC 10646 characters.
XML parsing is defined in terms of encoded byte streams.
If the XML spec weren't so strict about what a parser must do, it would be
able to operate on pre-decoded streams. But as it is, the lowest-level parser
must play dumb, and any Unicode-friendliness must be provided by a higher
layer. SAX for example does accept Unicode character streams as entities and
specifies that any encoding declaration appearing in the stream will be
ignored, which is technically a violation of a couple of rules, e.g. that
the declaration must be accurate :)
More information about the XML-SIG
mailing list