[XML-SIG] Re: Parsing a unicode string
Daniel Veillard
veillard at redhat.com
Tue Oct 5 23:16:12 CEST 2004
On Tue, Oct 05, 2004 at 01:10:16PM -0600, Mike Brown wrote:
> Fredrik Lundh wrote:
> > > I'd also expect parsers to accept unicode string objects with no encoding specification
> > > whatsoever. Decoding a Unicode encoding and parsing XML are two distinct steps
> >
> > not really; XML is defined in terms of encoded bytestreams.
>
> To clarify for Konrad's benefit -
>
> XML syntax is defined in terms of ISO/IEC 10646 characters.
> XML parsing is defined in terms of encoded byte streams.
>
> If the XML spec weren't so strict about what a parser must do, it would be
> able to operate on pre-decoded streams. But as it is, the lowest-level parser
> must play dumb, and any Unicode-friendliness must be provided by a higher
> layer.
Actually it should not be a problem:
http://www.w3.org/TR/REC-xml/#sec-guessing-with-ext-info
"The second possible case occurs when the XML entity is accompanied by
encoding information, as in some file systems and some network protocols."
this is typically the case if your environment tells you the data is
available in a given encoding (UCS4/UCS2/UTF-8 usually)
"When multiple sources of information are available, their relative
priority and the preferred method of handling conflict should be specified
as part of the higher-level protocol used to deliver XML."
One could argue that the internal API is a very high level protocol
or simply use
"If an XML entity is in a file, the Byte-Order Mark and encoding
declaration are used (if present) to determine the character encoding."
and in the case of strings if the BOM is present it will tell the right
way to decode the data at the parser level.
Daniel
--
Daniel Veillard | Red Hat Desktop team http://redhat.com/
veillard at redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
More information about the XML-SIG
mailing list