[XML-SIG] how to get the 'codepage' from a xml document

Mike Brown mike@skew.org
Fri, 10 Jan 2003 11:04:15 -0700 (MST)

Remy C. Cool wrote:
> My application appends/inserts data into an existing xml file ... some
> what like a print queue. So I need the encoding to be able to create 
> 'the new' xml file in the same encoding as the original and I don't 
> like to hardcode the encoding into the source.

Ah, that makes sense, then. Yes, the only way you're going to be able to do
that is to peek at the byte stream yourself and see if you can figure out the
encoding. If you're absolutely certain that the given encoding declaration is
accurate, and you're just dealing with XML in static files on disk or strings
in memory, you can do this. If you've got a non-rewindable stream, you'll have
to do some buffering.

> Would be nice to be able to acces this kind of information with a 
> class (like ContentHandler) in sax.

It's the parser's job to get rid of all the lexical info, because according to
how XML is defined, the application is only supposed to be concerned with the
logical structures: the hierarchy of elements, attributes, character data,
processing instructions, all Unicode-based. How the document was encoded, what
the tags looked like, what kind of quotes delimited attribute values, how the 
doc was split up into entities and entity references, extraneous whitespace,
etc. is all considered lexical fluff. For the most part, SAX only concerns 
itself with providing the application with the logical data, which is exactly 
what a parser is supposed to do.

> Another solution to this problem would be to create only Unicode XML 
> files, but then there's the problem that not 'all' text editors 
> understand unicode.  

This is a perennial problem with XML. Parsers are only required to support
UTF-8 and UTF-16. Editors that don't understand UTF-8 or UTF-16 are only going
to understand things like iso-8859-1 or platform default like windows-125x
(1252, etc.)... so if you encode XML with one of the text editors in mind, you
might be making the document unparsable in some parsers. I think most parsers
do support iso-8859-1, though, in real life.


  Mike J. Brown   |  http://skew.org/~mike/resume/
  Denver, CO, USA |  http://skew.org/xml/