[XML-SIG] Re: Parsing a unicode string

Daniel Veillard veillard at redhat.com
Tue Oct 5 23:16:12 CEST 2004

On Tue, Oct 05, 2004 at 01:10:16PM -0600, Mike Brown wrote:
> Fredrik Lundh wrote:
> > > I'd also expect parsers to accept unicode string objects with no encoding specification 
> > > whatsoever. Decoding a Unicode encoding and parsing XML are two distinct steps
> > 
> > not really; XML is defined in terms of encoded bytestreams.
> To clarify for Konrad's benefit -
> XML syntax is defined in terms of ISO/IEC 10646 characters.
> XML parsing is defined in terms of encoded byte streams.
> If the XML spec weren't so strict about what a parser must do, it would be 
> able to operate on pre-decoded streams. But as it is, the lowest-level parser 
> must play dumb, and any Unicode-friendliness must be provided by a higher 
> layer.

  Actually it should not be a problem:
 "The second possible case occurs when the XML entity is accompanied by
  encoding information, as in some file systems and some network protocols."

this is typically the case if your environment tells you the data is 
available in a given encoding (UCS4/UCS2/UTF-8 usually) 

 "When multiple sources of information are available, their relative
  priority and the preferred method of handling conflict should be specified
  as part of the higher-level protocol used to deliver XML." 

One could argue that the internal API is a very high level protocol
or simply use
 "If an XML entity is in a file, the Byte-Order Mark and encoding
  declaration are used (if present) to determine the character encoding."

and in the case of strings if the BOM is present it will tell the right
way to decode the data at the parser level.


Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard at redhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

More information about the XML-SIG mailing list