[XML-SIG] Re: Parsing a unicode string
Mike Brown
mike at skew.org
Wed Oct 6 01:46:07 CEST 2004
Daniel Veillard wrote:
> > If the XML spec weren't so strict about what a parser must do, it would be
> > able to operate on pre-decoded streams. But as it is, the lowest-level parser
> > must play dumb, and any Unicode-friendliness must be provided by a higher
> > layer.
>
> Actually it should not be a problem:
> http://www.w3.org/TR/REC-xml/#sec-guessing-with-ext-info
>
> "The second possible case occurs when the XML entity is accompanied by
> encoding information, as in some file systems and some network protocols."
>
> this is typically the case if your environment tells you the data is
> available in a given encoding (UCS4/UCS2/UTF-8 usually)
Oh, I'm sure that's still talking about parsing bytes, though. "It's already
decoded to a character sequence / encoding does not apply" is not the kind of
external encoding information that they are talking about there, or anywhere
else. That would be a very liberal reading of the spec, I think.
But perhaps this is a discussion for xml-dev. I'm not about to rejoin that
forum, though. As I just told someone else, xml-dev, to me, was too many
engineers coming up with too many solutions to too many problems that they,
themselves, have willed into existence. :)
Nevertheless, I think it would be a good idea for all of Python's XML parsing
APIs to support external encoding declarations, so it would at least be
possible to blindly encode to whatever your favorite encoding is and then
notify the parser accordingly.
Like I said, this functionality only went into 4Suite a few months ago [1],
and I went a bit out of my way to make it properly use the encoding
information from HTTP streams and to follow the rules of RFCs 3023 and 2616.
-Mike
[1] documented here:
http://uche.ogbuji.net/tech/akara/nodes/2004-06-12/external-encoding
More information about the XML-SIG
mailing list