[XML-SIG] Re: Parsing a unicode string

Mike Brown mike at skew.org
Wed Oct 6 01:46:07 CEST 2004

Daniel Veillard wrote:
> > If the XML spec weren't so strict about what a parser must do, it would be 
> > able to operate on pre-decoded streams. But as it is, the lowest-level parser 
> > must play dumb, and any Unicode-friendliness must be provided by a higher 
> > layer.
>   Actually it should not be a problem:
>    http://www.w3.org/TR/REC-xml/#sec-guessing-with-ext-info
>  "The second possible case occurs when the XML entity is accompanied by
>   encoding information, as in some file systems and some network protocols."
> this is typically the case if your environment tells you the data is 
> available in a given encoding (UCS4/UCS2/UTF-8 usually) 

Oh, I'm sure that's still talking about parsing bytes, though. "It's already 
decoded to a character sequence / encoding does not apply" is not the kind of 
external encoding information that they are talking about there, or anywhere 
else. That would be a very liberal reading of the spec, I think.

But perhaps this is a discussion for xml-dev. I'm not about to rejoin that 
forum, though. As I just told someone else, xml-dev, to me, was too many 
engineers coming up with too many solutions to too many problems that they, 
themselves, have willed into existence. :)

Nevertheless, I think it would be a good idea for all of Python's XML parsing 
APIs to support external encoding declarations, so it would at least be 
possible to blindly encode to whatever your favorite encoding is and then 
notify the parser accordingly.

Like I said, this functionality only went into 4Suite a few months ago [1], 
and I went a bit out of my way to make it properly use the encoding 
information from HTTP streams and to follow the rules of RFCs 3023 and 2616.


[1] documented here:

More information about the XML-SIG mailing list